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ABSTRACT 

The  problem  of  enhancement  and  bandwidth  compression 
of  noisy  speech  is  formulated  as  a  parameter  esimtation 
problem,  in  which  speech  and  its  model  parameters  are 
estimated  from  the  noisy  speech  based  on  the  MAP  estimation 
procedure.  Such  an  approach  leads  to  two  algorithms 
which  require  solving  sets  of1  linear  equations  in  an  itera¬ 
tive  manner.  Some  approximations  of  the  two  algorithms 
lead  to  two  systems  which  are  computationally  simpler 
than  the  two  algorithms  by  taking  advantage  of  a  high 
speed  FFT  algorithm.  As  a  preliminary  investigation  into 
the  performance  of  the  class  of  systems  developed,  two 
systems  are  implemented  and  applied  to  both  real  and 
synthetic  speech  data.  An  objective  and  informal  subjec¬ 
tive  evaluation  indicate  that  the  systems  implemented 
perform  well  as  enhancement  and  potential  bandwidth  com¬ 
pression  systems  of  noisv  sceech. 
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CHAPTER  I  INTRODUCTION 
1.1  Introduction 

Degradation  of  speech  by  additive  noise  occurs  in 
a  number  of  practical  situations.  For  example,  the  speech 
of  a  pilot  in  a  plane  communicating  with  the  ground  control 
is  degraded  by  the  airplane  noise.  Another  example  is 
the  speech  of  a  lecturer  recorded  in  a  noisy  lecture  hall. 
The  corrupting  noise  generally  reduces  [1]  both  the 
intelligibility  and  the  quality  of  speech.  Furthermore, 
the  performance  of  many  narrow-band  communcation  systems 
degrades  quickly  [2,3]  as  the  speech  to  noise  ratio 
decreases.  Thus,  techniques  for  enhancement  and  bandwidth 
compression  of  noisy  speech  have  a  variety  of  applications. 

In  developing  systems  for  speech  enhancement,  an 
important  task  is  defining  the  goal  of  speech  enhancement. 

A  clear  definition  of  this  goal  can  potentially  provide 
an  objective  criterion  on  the  basis  of  which  speech  enhance¬ 
ment  systems  can  be  developed.  Such  a  goal  also  provides 
a  criterion  for  evaluating  the  performance  of  a  system  for 
the  particular  application  under  consideration.  In 
general,  speech  enhancement  implies  a  subjective  improve¬ 
ment  of  the  speech  such  as  increased  intelligibility  and 
quality,  reduced  listener  fatigue,  etc.  It  is  important 
to  note  that  the  subjective  improvement,  even  though 
related,  is  not  necessarily  the  same  as  the  speech  to 
noise  ratio  increase.  For  example,  a  speech  processing 


system  which  eliminates  unvoiced  segments  and  low-pass 
filters  voiced  segments  of  speech  degraded  by  wide  band 
additive  noise  may  increase  the  overall  S/N  ratio  but 
probably  is  not  a  speech  enhancement  system  in  most 
practical  applications. 

Another  important  aspect  of  developing  a  speech 
enhancement  system  is  to  accurately  assess  what  information 
can  be  assumed  about  the  speech  and  the  background 
noise.  Given  a  noisy  speech  signal  with  no  assumptions 
of  the  speech  or  noise,  there  is  little  that  can  be  done 
to  enhance  the  speech  signal.  A  general  rule  for  any 
problem  requiring  the  separation  of  individual  signal 
components  (combined  by  addition,  convolution,  etc.)  is 
that  the  more  we  know  about  each  component,  the  better 
we  can  solve  the  problem.  Depending  on  the  nature  of  the 
corrupting  noise,  some  information  of  the  noise  may  be 
obtained  from  the  knowledge  of  the  source,  or  from  actual 
measurements.  About  speech,  a  great  deal  is  known  from 
the  vast  research  efforts  in  the  general  area  of  the  speech 
communications .  We  know  a  great  deal  about  the  human 
speech  production  mechanism  and  also  have  some  understand¬ 
ing  of  the  human  perception  of  speech.  In  principle, 
we  can  attempt  to  incorporate  everything  we  know  about 
speech  in  developing  a  speech  enhancement  system.  How¬ 
ever,  some  of  our  knowledge  is  qualitative  or  complicated 
and  its  incorporation  into  such  a  system  may  be  very 
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difficult.  For  example,  human  speech  has  linguistic 
constraints  imposed  by  the  rules  of  the  language.  But 
to  incorporate  such  knowledge  in  a  system  for  speech 
enhancement  is  probably  a  difficult  task.  Thus,  the 
extent  of  our  knowledge  of  speech  that  can  be  incorporated 
is  limited  by  our  capability  to  develop  and  implement 
systems  that  can  exploit  such  available  knowledge. 

In  developing  a  speech  enhancement  system,  two 
different  approaches  can  be  taken.  One  is  the  "noise 
removal"  approach  in  which  a  system  is  developed  to  elimi¬ 
nate  as  much  background  noise  as  possible  with  as  little 
speech  degradation  as  possible.  The  other  approach  is 
the  "reconstruction"  approach  in  which  the  speech  parameters 
sufficient  for  reconstruction  are  estimated  and  then 
speech  is  reconst. -.ucted  based  on  the  estimated  parameters. 
Which  approach  is  better  for  speech  enhancement  depends 
on  many  factors  such  as  how  much  we  know  about  speech. 
However,  for  relatively  high  S/N  ratios,  it  is  expected 
that  the  noise  reduction  approach  is  better  than  the 
reconstruction  approach  since  the  latter  generally 
changes  the  input  speech. 

Independent  of  which  approach  is  taken,  the  essence 
of  a  speech  enhancement  system  is  an  algorithm  that  incor¬ 
porates,  in  some  optimum  manner,  as  much  as  possible  of 
what  we  know  about  speech  and  the  background  noise.  The 
optimality  condition,  ideally,  should  be  based  on  the 
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specific  goal  of  speech  enhancement.  In  general,  such 
a  condition  is  unknown  or  quite  complicated  since  a 
subjective  quantity  such  as  speech  intelligibility  can  not 
easily  be  related  to  a  measurable  physical  quantity  that 
may  be  used  as  a  criterion  for  optimality.  In  the 
absence  of  such  a  criterion  or  if  the  resulting  system 
becomes  highly  complex  even  in  the  presence  of  such  a 
criterion,  we  may  consider  a  suboptimal  procedure  or 
define  the  optimal  condition  to  be  optimum  in  a  different 
sense  such  as  the  maximum  likelihood  sense. 

Suppose  we  have  formulated  an  algorithm  that 
incorporates  our  knowledge  about  speech  and  the  background 
noise  in  some  optimum  manner,  then  the  task  remains  to 
evaluate  the  performance  of  the  system  and  estimate  the 
implementation  cost.  In  general,  the  performance  improve¬ 
ment  of  a  speech  enhancement  system  can  only  be  shown  by 
an  adequate  evaluation.  Many  systems  that  have  been 
proposed  for  speech  enhancement  provide  apparent  improve¬ 
ment  in  the  S/N  ratio,  but  on  careful  evaluation  [4,5,6] 
in  fact  reduce  intelligibility.  If  the  system  proposed 
is  sufficiently  complex  such  that  the  implementation  cost 
is  too  high  relative  to  the  system  performance,  then  an 
alternative  procedure  has  to  be  considered.  Under  such  a 
circumstance,  we  may  have  to  go  back  to  the  beginning 
and  redefine  the  goal  of  speech  enhancement  or  reconsider 
the  types  of  knowledge  of  speech  and  the  background  noise 
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to  be  incorporated  into  a  speech  enhancement  system. 

Thus,  developing  a  speech  enhancement  system  under  a 
specific  objective  and  cost  constraints  requires  a 
repetitive  procedure  that  begins  from  a  clear  definition 
of  the  goal  of  speech  enhancement  and  ends  with  a  decision 
based  on  the  evaluation  of  the  system  performance  and 
estimation  of  the  implementation  cost,  but  probably  after 
some  iterations. 

The  problem  of  bandwidth  compression  of  noisy  speech 
is  closely  related  to  the  speech  enhancement  problem. 

For  example,  a  successful  speech  enhancement  system  with 
the  reconstruction  approach  has  the  potential  to  be  used 
as  a  bandwidth  compression  system  for  noisy  speech. 
Alternatively,  the  noise  reduction  approach  can  be  used  as 
a  pre-processor  for  a  bandwidth  compression  system.  Con¬ 
sequently,  the  approach  to  developing  a  bandwidth  compres¬ 
sion  system  for  noisy  speech  is  essentially  the  same  as 
that  for  a  speech  enhancement  system  except  for  some 
additional  considerations  such  as  coding  the  speech  para¬ 
meters,  the  degree  of  bandwidth  compression  desired, 

% 

etc.  In  fact,  assuming  the  same  knowledge  of  speech  and 
the  background  noise,  and  using  the  same  optimal  criterion 
for  both  a  speech  enhancement  system  and  a  bandwidth 
compression  system,  we  would  expect  that  the  speech 
enhancement  system  would  look  very  similar  to  the  bandwidth 
compression  system.  The  only  major  difference  would  be 
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that  for  the  speech  enhancement  system,  the  speech 
could  be  generated  either  by  the  noise  removal  or 
reconstruction  approach  whereas  for  the  bandwidth 
compression  system,  speech  must  generally  be  reconstructed. 

The  problem  of  speech  enhancement  has  received 
a  great  deal  of  attention  in  recent  years  and  numerous 
systems  have  been  proposed  to  enhance  degraded  speech. 
Nevertheless,  significant  improvements  in  speech  intelligi¬ 
bility  or  quality  in  practical  situations  have  not  yet  been 
demonstrated  by  any  of  the  existing  systems.  Part  of 
the  problem  appears  to  be  that  the  approaches  taken  in 
developing  various  speech  enhancement  systems  capitalize 
very  little  on  our  knowledge  of  speech.  The  proposed 
systems  differ  primarily  in  how  the  small  amount  of 
knowledge  about  the  speech  incorporated  into  the  system 
is  exploited  and  how  the  resulting  speech  is  generated. 

It  will  become  clear  in  our  discussions  in  Chapter  II 
that  if  we  follow  the  same  approach  that  has  led  to  the 
various  existing  systems,  we  can  easily  generate  systems 
at  a  faster  rate  than  we  can  evaluate  their  performance 
or  even  implement  them.  Regardless  of  their  performances, 
if  we  develop  a  speech  enhancement  system  capitalizing 
more  fully  on  our  knowledge  of  speech  in  an  "optimal" 
manner  we  would  expect,  in  general,  a  better  performance. 

In  this  dissertation,  we  develop  systems  for  enhancement 
and  bandwidth  compression  of  noisy  speech  by  attempting 
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to  "optimally"  incorporate  a  specific  underlying  speech 
model.  The  objective  of  this  dissertation  is,  of  course, 
to  develop  speech  enhancement  and  bandwidth  compression 
systems  that  are  potentially  applicable  to  practical  situa¬ 
tions.  An  equally  important  objective  of  this  dissertation 
is  to  suggest  the  direction  of  other  future  research 
efforts  by  illustrating  an  example  of  a  structured  and 
theoretical  approach  for  incorporating  more  of  what  we 
know  about  speech  to  develop  enhancement  and  bandwidth 
compression  systems  of  noisy  speech. 

1.2  Scope  of  Thesis 

In  this  dissertation,  various  speech  enhancement 
systems  proposed  in  the  literature  are  summarized  and 
related  to  each  other  in  a  more  common  framework.  Some  of 
the  speech  enhancement  systems  which  appeared  to  be 
promising  were  studied  more  carefully  and  were  evaluated 
in  terms  of  their  performance  in  improving  spe,ech 
intelligibility.  As  an  attempt  to  optimally  incorporate 
more  of  what  we  know  about  speech  in  developing  systems 
for  enhancement  and  bandwidth  compression  of  noisy  speech, 
a  parameter  estimation  problem  is  formulated.  The 
parameter  estimation  problem  is  then  considered  for  both 
noise-free  and  noisy  speech.  For  noise-free  speech, 
different  points  of  view  such  as  Maximum  Likelihood 
approach  [7,8],  Maximum  A  Posteriori  estimation  approach, 
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and  Kalman  filtering  approach  [9]  are  reviewed  carefully 
and  related  to  each  other  and  to  the  conventional  linear 
prediction  analysis.  For  noisy  speech,  the  parameter 
estimation  problem  is  shown  to  be  generally  non-linear. 
Therefore,  two  "suboptimal"  procedures  which  have  linear 
implementations  are  developed.  In  addition,  two  systems 
for  bandwidth  compression  and  enhancement  of  noisy  speech 
which  are  computationally  simpler  than  the  linear  imple¬ 
mentations  are  developed  by  approximating  the  linear  imple¬ 
mentations.  As  a  preliminary  investigation  into  the  per- 

9 

formance  of  systems  developed  in  this  dissertation,  a 
small  subset  of  the  systems  are  implemented  and  applied 
to  both  synthetic  and  real  speech  data.  An  objective  and 
informal  subjective  evaluation  indicate  that  the  implemented 
systems  perform  well  as  bandwidth  compression  and  speech 
enhancement  systems  at  various  S/N  ratios.  Finally,  a 
number  of  potential  areas  of  study  which  are  not  performed 
as  a  part  of  the  thesis  but  are  within  the  scope  of  the 
theoretical  results  obtained  in  the  thesis  are  summarized 
and  a  possible  direction  of  future  research  in  this  area 
is  suggested. 

1.3  Summary  of  Chapters 

In  Chapter  II,  various  existing  speech  enhancement 
systems  are  summarized  and  related  to  each  other  in  a  common 
framework.  In  Chapter  III,  we  discuss  a  specific  model  of 
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speech  and  the  Maximum  A  Posteriori  (MAP)  estimation 
approach  taken  in  this  thesis  to  estimate  the  speech  model 
parameters.  In  Chapter  IV,  the  MAP  estimation  procedure 
for  noise-free  speech  is  discussed.  The  MAP  estimation 
procedure  under  different  assumptions  leads  to  different 
sets  of  equations  to  solve,  two  of  which  are  equivalent 
to  the  covariance  and  correlation  method  of  the  linear 
prediction  analysis.  In  Chapter  V.  we  discuss  the  MAP 
estimation  problem  for  speech  degraded  by  additive  random 
noise.  The  theoretical  results  in  this  chapter  will  lead 
to  two  algorithms  that  require  solving  sets  of  linear 
equations  in  an  iterative  manner  to  estimate  the  speech 
model  parameters  from  the  noisy  speech.  In  Chapter  VI, 
we  develop  two  systems  based  on  the  algorithms  developed 
in  Chapter  V.  The  two  systems  developed  are  approxima¬ 
tions  of  the  two  algorithms  in  Chapter  V  and  are  computa¬ 
tionally  simpler  than  the  two  algorithms.  In  addition 
to  the  two  systems,  we  develop  an  "ad-hoc"  system  primarily 
for  the  comparison  of  the  two  systems  developed  in  this 
thesis  with  other  speech  enhancement  systems  previously 
proposed.  In  Chapter  VII,  the  performance  of  the  three 
systems  developed  in  Chapter  VI  in  estimating  the  speech 
model  parameters  is  qualitatively  demonstrated  by  various 
examples  based  on  both  synthetic  and  real  speech  data.  In 
Chapter  VIII,  the  performance  of  the  three  systems  is 
discussed  in  greater  detail  and  quantitatively  based  on  the 
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results  of  both  the  objective  and  subjective  tests.  The 
objective  tests  are  based  on  the  synthetic  data  and  an 
objective  criterion  which  reflects  the  perceptually 
important  aspects  of  the  speech  parameters.  The  subjec¬ 
tive  tests  are  divided  into  two  parts,  one  part  correspond¬ 
ing  to  the  bandwidth  compression  of  noisy  speech  and  the 
second  part  corresponding  to  speech 
comparison  of  various  systems  in  terms  of  bandwidth  compres¬ 
sion  are  based  on  the  synthesized  sentences  from  the  speech 
model  parameters  obtained  by  the  developed  systems.  In 
the  case  of  speech  enhancement,  two  cases  are  considered. 

In  the  first  case,  speech  is  generated  by  the  noise  reduc¬ 
tion  approach.  In  the  second  case,  speech  is  generated  by 
a  complete  analysis/synthesis  systems.  In  all  cases  of 
the  subjective  tests,  the  evaluation  is  informal  and  based 
on  a  few  sentences  spoken  by  both  male  and  female  speakers 
judged  by  listeners  with  no  or  some  previous  experience 
in  the  subjective  tests.  In  Chapter  IX,  we  suggest  a 
direction  and  some  potential  areas  of  future  research.  In 
Chapter  X,  we  conclude  the  thesis  by  summarizing  the  main 
results  of  this  dissertation. 
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CHAPTER  II  SURVEY  OF  SPEECH  ENHANCEMENT  TECHNIQUES 

11. 1  Introduction 

A  number  of  techniques  have  been  previously  proposed 
for  the  enhancement  of  noisy  speech.  The  purpose  of  this 
chapter  is  to  summarize  various  speech  enhancement  tech¬ 
niques  in  a  common  framework  and  relate  them  to  the  band¬ 
width  compression  systems  of  noisy  speech.  In  Section 

11. 2,  various  speech  enhancement  systems  are  summarized 
and  related  to  each  other.  In  Section  II. 3,  we  summarize 
the  performance  of  some  of  the  systems  discussed  in  Section 

11. 2.  Some  of  the  results  are  based  on  an  informal 
listening  or  a  formal  speech  intelligibility  test  conducted 
in  this  research  and  some  others  are  based  on  the  studies 
by  other  researchers.  In  Section  II. 4,  we  discuss  various 
bandwidth  compression  systems  which  are  based  on  the 
speech  enhancement  systems  summarized  in  Section  II. 2.  In 
Section  II.  5,  we  discuss  the  motivation  for  a  new  approach 
to  the  problem  of  speech  enhancement  and  bandwidth 
compression  of  noisy  speech. 

11. 2  Speech  Enhancement  Techniques 

II. 2.1  Adaptive  Comb  Filtering  Method 
Comb  filtering  for  speech  enhancement  is  based  on  the 
notion  that  voiced  sounds  are  periodic  with  a  period  that 
corresponds  to  the  fundamental  frequency.  Since  the  inter¬ 
fering  signals  in  general  have  energy  in  the  frequency 


*  / 


-31- 


regions  between  the  speech  harmonics,  a  comb  filtering 
operation  in  principle  can  reduce  noise  while  preserving 
speech  signals  to  the  extent  that  information  of  the 
fundamental  frequency  is  available  and  periodicity  of 
speech  is  strictly  preserved.  Capitalizing  on  this  knowl¬ 
edge,  a  comb  filtering  operation  that  passes  only  the 
harmonics  of  speech  was  first  applied  by  Shields  [10] 
to  enhance  degraded  speech.  Frazier  [11]  later  observed 
that  even  with  accurate  fundamental  frequency  information 
Shields'  adaptive  comb  filtering  method  distorts  speech 
signals  significantly  due  to  the  time  varying  nature  of 
speech  sounds.  To  reduce  some  of  this  distortion,  Frazier 
suggested  an  adaptive  comb  filter  [11]  which  adjusts 
itself  to  variations  in  the  fundamental  frequency.  A 
further  improvement  on  Frazier's  algorithm  on  treating 
the  transition  regions  between  voicing  and  unvoicing  was 
mady  by  Lim  [5].  In  Frazier's  algorithm,  when  voiced 
sounds  near  the  transitions  are  processed,  the  adaptive 
comb  filter  extends  over  the  unvoiced  sounds  due  to  the 
filter  length  which  causes  some  distortion.  By  setting 
the  filter  coefficients  that  extend  over  unvoiced  sounds 
to  zero,  Lim  [5]  found  that  a  better  performance  can  be 
obtained. 

Comb  filtering  generally  requires  accurate  pitch 
information.  Parsons  [12]  developed  a  system  which  is 
similar  to  comb  filtering  but  the  pitch  information  is  not 
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obtained  separately  but  built  into  the  system.  More 
specifically,  in  an  application  to  a  competing  speaker 
environment,  each  of  the  local  spectral  peaks  in  a  high 
resolution  short  time  Fourier  transform  of  voiced  sounds 
is  distinguished  between  the  main  speaker  and  a  competing 
speaker.  Then  speech  is  generated  based  on  the  spectral 
contents  that  correspond  to  the  peaks  of  the  main  speaker. 

Systems  based  on  comb  filtering  have  been  evaluated 
in  this  research  and  by  other  researchers  and  the  results 
are  summarized  in  Section  II. 3.1. 

II. 2. 2  Correlation  Subtraction  Method 

The  correlation  subtraction  method  for  speech  enhance¬ 
ment  is  based  on  the  notion  that  if  additive  noise  is 
uncorrelated  with  the  signal,  then  the  correlation  of  the 
signal  equals  the  noise  correlation  subtracted  from  the 
correlation  of  the  observed  signal.  More  specifically, 
when  a  signal  is  degraded  by  additive  background  noise, 
a  noisy  signal  y(n)  can  be  represented  by 

y  (n)  =  s  (n)  +  d  (n)  (2-1) 

in  which  s(n)  and  d(n)  represent  the  signal  and  the  back¬ 
ground  noise  (or  disturbance)  respectively.  Multiplying 
both  sides  of  equation  (2-1)  by  y(n-k)  and  taking  the 


expected  value, 


E[y(n)  *y  (n-k)  ]  =  E [s (n) • s (n-k) ]  +  E [d (n) • d (n-k) ] 

+  E [d (n) • s (n-k) ]  +  E (d (n-k) • s (n) ]  (2-2) 

If  s(n)  is  assumed  to  be  uncorrelated  with  d(n),  the  last 
two  terms  in  equation  (2-2)  disappear  and  thus 

E[y(n) -y(n-k) ]  =  E (s (n) • s (n-k) ]  +  E  [d  (n) -d (n-k) ]  (2-3) 

If  s(n)  and  d(n)  are  assumed  to  be  stationary  so  that  the 
expectation  of  the  two  functions  depends  only  on  their 
time  differences,  equation  (2-3)  with  a  change  of  variables 
can  be  written  as 


R  (n)  -  R  (n)  +  R,  (n)  (2-4) 

y  s  d 

in  which  Rx(n)  represents  E  [x  ( l  )-x  ( i-n)  ]  ,  the  correlation 
of  x(n).  Fourier  transforming  equation  (2-4)  leads  to 


P  (u)  =  PgU)  +  Pd(u>) 


(2-5) 


in  which  P  ('ju)  represents  F[R  (n)]  =  T  R  (n)  *e  ^un,  the 

a  X  „  A 
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power  spectrum  of  x(n).  It  is  clear  from  equation  (2-4) 

that  the  subtraction  of  R.  (n)  from  R  (n)  leads  to  R„(n)  and 

d  y  s 
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thus  the  name  "correlation  subtraction"  method. 

In  the  case  of  speech,  the  correlation  function  can 

not  be  expressed  as  R  (n)  since  speech  can  not  be  considered 

s 

stationary.  Thus  we  define  the  short  time  correlation  of 

speech  $  (n)  as 
s 


<t>  (n)  =  I  sfll-s  U-n)  (2-6) 

3  £  =  -oo  W  W 

in  which  s  (£)  represents  the  windowed  speech  waveform. 

One  important  difference  between  $  (n)  and  R  (n)  is  *  (n) 

s  s  s 

can  be  defined  for  non-stationary  signals  as  well  as  for 
stationary  signals.  Since  y^(n)  =  sw(n)  +  dw(n),  multiply¬ 
ing  both  sides  with  yw(n-k)  and  summing  over  all  n  leads  to 


4>  ( n )  =  <J>  (n)  -  <j>  .  ( n)  -  2*p  ,(n)  (2-7) 

s  y  c  sd 


where 


Vn)  =  l  y wu>  -ywu-n> , 

1  4  =  —  00 

00 

=  £  dwU>  ,dwu-n)  ' 


and 


}sd(n>  '  l  swU)‘dwU‘n) 

£  =  -00 


Equation  (2-7)  is  exact  without  any  approximations. 
We  will  find  that  a  number  of  speech  enhancement  systems 


summarized  in  this  chapter  differ  primarily  in  how  $s(n) 

is  specifically  estimated  and  how  speech  is  generated 

once  <fr  (n)  is  estimated.  We  will  also  find  that  in  various 
s 

speech  enhancement  systems,  equation  (2-7)  is  a  starting 

point  for  estimating  <j>  (n)  from  y(n).  Before  we  discuss 

s 

how  <p  (n)  is  specifically  estimated  in  the  correlation 

subtraction  method,  it  is  worthwhile  to  note  why  it  is 

important  to  attempt  to  estimate  $  (n)  accurately.  From 

equation  (2-6)  4>g(n)  is  related  to  |s  (u>)|,  the  magnitude 

of  the  discrete  time  Fourier  transform  of  s  (n) ,  by 

w 

oo 

|  S  (<jj)  1 2  *  F  [<fi  (n)  ]  =  l  <b(n)*e"jun  (2-8) 

5  n«— 

Thus  the  attempt  to  estimate  $g(n)  more  accurately  is 

equivalent  to  attempting  to  preserve  the  short  time 

spectral  information  of  soeech  Is  (to)  |  which  is  known  [13] 

w 

to  be  important  for  both  the  intelligibility  and  quality 
of  speech. 

In  the  correlation  subtraction  method,  $  (n)  is 

s 

estimated  based  on  equation  (2-7) .  From  the  windowed 

noisy  speech  yw(n),  +>^(n)  can  be  directly  computed.  $d(n) 

and  $s£j(n)  can  not  be  obtained  exactly  from  y(n)  unless 

d(n)  is  exactly  known  and  in  the  correlation  subtraction 

method,  *d(n)  and  Ji  ^(n)  are  approximated  by  E[j^(n)]  and 

E[a  ,  ( n ) ] .  For  a  zero  mean  c(n)  uncorrelated  with  s(n), 
sa 

E[$  .(n)]  equals  zero  and  therefore  equation  (2-7)  can  be 


approximately  written  as 


<t>_(n)  =  <6  (n)  -  E[$  (n)  ]  (2-9) 

s  y  a, 

E[<))^{n)]  can  be  obtained  either  from  the  assumed  known 
statistics  of  d(n)  or  by  an  actual  measurement  from  the 
background  noise  in  the  intervals  when  speech  is  not 
present-  Fourier  transforming  equation  (2-9) , 

i  sw  {oj)  I  2  *  lY„(uj)|2  -  E  [  |  Dw  ( u. )  I  2  ]  (2-10) 

2 

Based  on  equations  (2-9)  and  (2-10),  <j>  (n)  and  S  (a))!  are 

s  w 

estimated  as 

<J>  (n)  =  <J>  ( n)  -  E[<J)  ,(n)]  (2-lla) 

w  y  q 

and 

|SwU)|2  =  j  Yw  (w )  !  2  -  E[|Dw(co)|2]  (2-llb) 

^  2 

From  equation  (2-llb),  |s  (w) I  is  not  guaranteed  to  be 

w 

non-negative.  This  is  because  there  is  no  built-in 

mechanism  in  the  above  estimation  procedure  to  force  i  (n) 

s 

to  correspond  to  the  short  time  correlation  of  some  real 
sequence.  When  such  a  situation  dees  occur,  a  number  of 
different  arbitrary  steps  may  be  taken.  In  some  studies, 
the  negative  values  are  made  positive  by  changing  the  sign. 


2  2 

In  some  other  studies  |S  (gj)  !  is  set  to  zero  if  |Y  (gj)  I 

w  1  1  w  1 

2 

is  less  than  E  [  |  D  (ui)  |  ]  . 

Given  an  estimate  of  0  (n)  or  Is  (oj)  f  ,  there  are  a 

s  w 

number  of  different  ways  to  generate  speech.  One  method 

which  is  popular  in  the  class  of  systems  related  to  some 

form  of  spectral  subtraction  is  to  approximate  *Sw(gj), 

the  phase  of  S  (gj)  ,  by  $Y  (gj)  and  then  generate  s  (n) 

or  S  (gj)  by 
w  J 

/\ 

~  -  j<S  (gj) 

S  (gj)  =  |S  (gj)  |  -e  w  (2-12a) 

w  w  1 

and 

s"(n)  *  F'1[Sr  (gj)]  (2-12b) 

w  w 

A  typical  algorithm  for  speech  enhancement  by  the  correla¬ 
tion  subtraction  method  is  shown  in  Figure  2.1.  The 
system  in  Figure  2.1  has  been  evaluated  in  this  research 
and  the  results  are  summarized  in  Section  II.  3. 2. 

Generating  s  (n)  by  equation  (2-12)  corresponds  to 
taking  the  noise  reduction  approach  for  speech  enhancement. 
As  we  discussed  in  Chapter  I,  it  is  possible  to  take  the 
reconstruction  approach  as  we'll  see  shortly. 

II.  2. 3  Speech  Enhancement  by  a  Voice  Excited  Vocoder 
Magill  and  Un  [14]  developed  a  speech  enhancement 
system  by  a  voice  excited  LPC  vocoder  when  the  background 


Windowed  Noisy  Speech  y*(n) 


Z_Sw( uj)  —  L  Yw(a>) 

ISv»<a»)l='^IYw(u»)f*”E^IDw(cj)i*  jj^ 
for  lY*U)f >  E[lDw(«>f  ] 

0  otherwise 

Estimated  Windowed  Speech  s'wfn) 


Figure  2.1  A  typical  speech  enhancement  system  by  the 
correlation  subtraction  method 
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noise  is  white.  The  system  information,  namely  the 
LPC  coefficients,  is  obtained  by  the  correlation  method 
of  the  linear  prediction  analysis  in  which  the  shor4-  time 
correlation  of  speech  is  estimated  by  the  correlation 
subtraction  method  discussed  in  Section  II. 2. 2.  For 
the  source  information,  the  noisy  speech  is  low  pass 
filtered  at  600  Hz  and  then  non-linearly  distorted  to 
broaden  its  bandwidth.  This  is  based  on  the  notion  that 
voiced  speech  generally  decays  at  6  db/octave  rate  and 
therefore  the  low  frequency  components  are  least  degraded 
by  additive  white  noise.  Speech  is  then  generated  based 
on  the  estimated  source  and  system  information. 

The  system  by  Magill  and  Un  is  identical  to  the 
correlation  subtraction  method  in  estimating  4>s(n) 
from  y(n).  The  difference  lies  in  how  speech  is  generated 
based  on  the  estimated  d  (n) .  The  reconstruction  approach 
taken  in  this  system  has  a  disadvantage  in  that  the 
source  information  has  to  be  obtained  in  some  manner. 
However,  it  has  the  advantage  that  the  speech  enhancement 
system  can  be  used  not  only  as  a  pre-processor  for  various 
bandwidth  compression  systems  of  noisv-free  speech,  but 
also  as  a  bandwidth  compression  system  itself.  The  perfor¬ 
mance  of  the  system  by  Magill  and  Un  is  not  known. 


II.  2. 4  INTEL  System 

Weiss,  et  al .  [15]  developed  a  speech  enhancement 


system  called  INTEL  or  "Intelligibility  Enhancement  by 
Liftering".  The  INTEL  system  has  several  versions.  One 
early  version  is  based  on  the  notion  that  in  the  short 
time  correlation  domain  speech  is  in  general  more  spread 
from  the  origin  than  the  background  noise  such  as  white 
noise.  Therefore  some  form  of  gating  out  (liftering) 
the  low  time  region  of  the  short  time  correlation  of 
noisy  speech  may  eliminate  more  noise  components  than 
speech  and  thus  may  lead  to  some  speech  enhancement. 

When  a  system  based  on  this  method  was  implemented  by 
Weiss,  et  al.  [15]  and  also  in  this  research,  the  perfor¬ 
mance  of  the  system  was  found  to  be  rather  poor. 

Another  version  of  the  INTEL  system  which  in  a  sense 

is  a  generalization  of  the  correlation  subtraction  method 

has  been  studied  in  some  detail  in  this  research.  The 

INTEL  system  referred  from  this  point  on  corresponds  to 

this  version  of  the  INTEL  system.  In  Section  II .  2 . 2  , 

it  was  shown  that  the  correlation  subtraction  method 

corresponds  to  estimating  the  short  time  correlation  of 

speech  (n)  by  F_1[|y  (oj  )  [ 2  ]  -  E[f“1[|D  (oj)|2]].  Weiss, 

et  al.  simply  replaced  the  squaring  operation  with  an 

arbitrary  positive  real  constant  "a".  In  this  method,  then, 

by  defining  4>  ^  (n)  to  be  F  1  [  1  Sw  (u)  |  a]  ,  <j>^(n)  is  estimated  by 

F  ^  [  j Y  (^)|a]  -  E[F  ^ [ I D„ (a)  I a] ]  .  Based  on  this  estimate 
w  w 

of  o '  (n)  and  the  assumption  that  fS  (w)  ecuals  $Y  (w)  , 
s  r  w  *  w 

speech  is  generated.  The  speech  enhancement  system  proposed 
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by  Weiss,  et  al.  is  shown  in  Figure  2.2. 

The  algorithm  in  Figure  2.2  can  be  simplified  both 

computationally  and  conceptually  by  recognizing  that  the 

expectation  and  Fourier  transform  operations  are  linear 

and  hence  can  be  inter-changed.  Such  a  simplified  system 

is  shown  in  Figure  2.3.  The  figure  clearly  shows  that 

the  INTEL  system  is  one  way  of  estimating  the  short  time 

spectral  magnitude  of  speech.  In  Figure  2.3  when  |S  (ai)  | 

w 

obtained  is  not  positive,  it  is  set  to  zero  for  the 
similar  reason  discussed  in  Section  II. 2. 2.  The  perfor¬ 
mance  of  the  INTEL  system  is  summarized  in  Section  II.  3. 2. 


I I. 2. 5  SABER  Method 

Boll  [16]  developed  a  speech  enhancement  system 
called  SABER  or  "Spectral  Averaging  for  Bias  Estimation 
and  Removal".  In  this  method,  !s  ( to )  I  is  estimated  by 
subtracting  E[  'D..(j)  ‘]  from  a  local  average  of  |  Y  ( )  !  . 

More  specifically,  it  is  assumed  that 

I  Vw)  I  *  k  l  iYwU)  li  '  E  C  iDw(oa)  |  ]  (2-13a) 

-  where  |Yw(jj)  represents  !Yw(oj)|  obtained  from  the  ith 

I 

I.  .  segment  of  the  noisy  speech  and  M  is  the  number  of  consecu- 

i  tive  windows  used  for  local  averaging. 

To  relate  the  SABER  method  to  the  INTEL  system,  we 
rewrite  equation  (2-13a)  as  follows: 


INTEL  system  proposed  by  Weiss  et  al 


"'Sb*  ‘ 
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|Sw(o>)|*jj  {  <IV“)  li.  “  E[|Dw(<0>  |J>  (2-13b) 

The  terra  |Y  (oj )  I  .  -  E[|D  (cu)  I  ]  in  equation  (2-13)  is 
w  i  w 

how  |s  (w) is  estimated  by  the  INTEL  system  with  a=l. 

Therefore  the  SABER  method  is  equivalent  to  estimating 

|  S  (u)  i  by  a  local  average  of  the  sets  of  |  S ^  ( oj )  |  obtained 
w  w 

• 

by  the  INTEL  system  with  a=l  if  the  same  windows  are  used 
in  both  cases.  In  fact,  in  the  implementation  of  the  INTEL 
system,  some  form  of  local  averaging  is  done  by  applying 
the  windows  that  are  overlapped  with  each  other  to  the 
input  noisy  speech  data.  In  this  context,  then,  the 
SABER  method  can  be  viewed  as  a  variation  of  a  special 
case  of  the  INTEL  system  shown  in  Figure  2.3.  The 
evaluation  results  of  the  SABER  method  reported  by  Boll 
are  summarized  in  Section  II. 3. 3. 

In  a  more  recent  study  [17]  ,  Boll  reported  that  the 
local  averaging  discussed  above  is  not  important  in  his 
system. 

II. 2. 6  Other  Generalizations  of  Correlation 
Subtraction  Method 

The  INTEL  system  discussed  in  Section  II.  2. 4  is  in 
a  sense  an  arbitrary  generalization  of  the  correlation 
subtraction  method.  An  alternative  arbitrary  generalization 
is  to  estimate  |Sw(w)j2  by  |Y  -  k*  E  [  !  (  jj )  j  2  ]  for 

some  arbitrary  constant  k  and  based  on  this  estimate  of 
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]  Sw  (oa )  |  ,  speech  can  be  generated  in  the  same  manner 

as  in  the  correlation  subtraction  method.  This  system 

was  proposed  [18]  for  possible  speech  enhancement  and 

studied  in  this  research.  The  performance  of  this  system 

is  summarized  in  Section  II. 3. 4. 

In  a  more  recent  study  [19],  Schwartz  etal.  considered 

for  speech  enhancement  the  sarnie  system  discussed  above. 

In  their  study,  an  additional  feature  is  included  in 

*  2 

that  after  the  subtraction  |s  (w) |  obtained  is  compared 

2 

to  a  threshold  level  S*E[|D  (ui)|  ]  for  a  small  arbitrary 

*  2 

constant  S  and  if  J  S  (o>)  |  is  smaller  it  is  set  to 
2 

6*E[|D  (u)|  ].  Thus  in  their  system, 

|S*<w>  |2  =  I  Yw(w)  I  2  -  3c-E[|Dw(a»)  |2] 

for  |Yw(oj)|^  >  (k+S)  •  E  [  |  (to)  |  2  ]  , 

2 

S  *  E  [  |D  (oj)  I  ]  otherwise 
w 

Clearly,  there  exist  a  number  of  other  arbitrary  gener¬ 
alizations.  For  example,  we  could  estimate  [Sw(aj)ja  by 
|Yw(id)  |a  -  k*E[|D  (uj)  |a]  for  some  arbitrary  constants 
a  and  k.  Such  a  system  includes  both  the  INTEL  system 
(by  setting  k=l)  and  the  system  discussed  in  this  section 
(by  setting  a=2)  as  special  cases. 
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II. 2. 7  SPAC  and  SPOC 

Suzuki  developed  [20]  a  speech  enhancement  system 
called  SPAC  or  "Splicing  of  Autocorrelation  Function". 

SPOC  or  "Splicing  of  Cross-correlation  Function"  is  a 
revised  version  [21]  of  SPAC.  The  two  systems  have  been 
used  for  compression  or  expansion  of  the  spectrum,  or 
lengthening  or  shortening  the  duration  of  speech,  or 
reducing  the  noise  level  in  the  speech  signal.  In  the 
discussions  in  this  section  only  the  noise  reduction 
aspect  is  considered. 

SPAC  is  based  on  the  notion  that  the  short  time 
correlation  of  speech  has  common  frequency  components 
with  the  short  time  speech.  Therefore,  for  voiced  sounds 
that  are  periodic  with  the  fundamental  frequency,  the 
short  time  correlation  properly  defined  is  also  periodic 
with  the  fundamental  frequency.  Furthermore,  if  one 
replaces  each  pitch  period  of  speech  with  the  corresponding 
pitch  period  of  the  short  time  correlation,  then  the 
frequency  components  of  speech  would  be  unchanged  except 
that  the  spectral  magnitude  at  each  frequency  would  be 
approximately  squared.  Since  the  effect  of  the  background 
noise  such  as  white  noise  generally  degrades  more  the 
points  near  the  origin  in  the  short  time  correlation 
domain,  speech  may  be  enhanced  by  replacing  each  pitch 
period  of  speech  with  one  pitch  period  of  the  corresponding 
short  time  correlation  beginning  some  points  away  from  the 


origin.  Suzuki  observed  that  SPAC  causes  some  distortions 
due  to  the  squaring  operation  of  the  spectral  magnitude 
of  speech  caused  by  replacing  speech  with  its  short  time 
correlation.  SPOC  is  a  revision  of  SPAC  to  reduce 
such  distortions. 

To  ap  reciate  how  this  method  compares  to  other 

methods  in  terms  of  its  performance,  we  consider  a  very 

simple  example.  Suppose  the  background  noise  is  zero  mean 

2 

and  white  Gaussian  with  the  variance  of  a  .  and  further 

d 

assume  that  s(n)  is  periodic  with  the  period  of  T  such 
that  s(n+T)  =  s(n)  for  all  n.  Vie  define  the  short  time 
correlation  of  speech  $*(n)  at  nQ  by 

nQ+M-l 

<D*(n)  =  l  sU)  *sU-n) 

*=n0 

for  some  fixed  M  and  $*(n)  and  <p^(n)  are  similarly  defined 

Note  that  <p*  (n)  is  slightly  different  from  p  (n)  in  that 
s  s 

the  summation  is  over  M  number  of  points  independent  of  n. 
Three  cases  are  considered.  In  the  first  case,  $*  (n) 
is  simply  estimated  as  $*(n)  and  thus 

4>*(n)  =  $*(n)  for  0  <_  n  <_  T-l  (2-14) 

In  the  second  case,  i*(n)  is  estimated  by  i*(n)  -  E[o*(n)] 

s  v  a 


and  therefore 


T<Wrt>>'  v-  - 
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<j>*(n)  =  $*(n)  -M*0j*6(n)  for  0  <_  n  <_  T-l  (2-15) 
s  y  a. 

This  case  corresponds  to  the  correlation  subtraction  method. 

The  third  case  corresponds  to  estimating  $*(n)  by  SPAC 

s 

and  therefore 

/v 

4>*(n)  =  $*(n+T)  for  n  =  0 

<P*(n)  for  1  <_  n  £  T-l  (2-16) 


Comparing  equations  (2-14) ,  (2-15)  and  (2-16) ,  <i*(n) 
estimated  is  the  same  for  1  <_  n  <_  T-l  in  all  three  cases. 

A 

Defining  e(0)  =  $  *  ( 0 )  -  <j>*(0),  it  can  be  easily  shown 

S  5 

for  case  1, 

E  [e  (0)  ]  =  M*oj? 

a. 

nQ+M-l 

Var (e ( 0 ) ]  =  4  •  £  s +  2M*ffJ  (2-17a) 

Z=n0 

for  case  2 , 


E  [e  (0)  ]  =  0 


Var  [e (0) ]  =  4 


nQ+M+l 

•  l 

*=n„ 


2  2  4 

s'  ( l)  *o1[  +  2M •  c^ 


1 


(2-17b) 
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and  for  case  3, 

E  [e  (0)  ]  =  0 

VM_1 

Var[e(0)]  =  2  •  £  s2(il)*a2  +  M*ot  +  ^ 

£=n0 

n  +M-1 

U  2  2 

in  which  k  <<  2  £  s  (2.)  *o 

£«n0  •  d 

Hq+M-1 

and  therefore  Var[e(0)]  -2  l  s2 (2)  *o2+M*a^  (2-17c) 

£=n0 

The  above  comparison  shows  that  the  correlation  subtraction 
method  eliminates  the  bias  but  does  not  reduce  the  error 
variance.  SPAC  eliminates  the  bias  and  reduces  the  error 
variance  by  about  50%. 

On  the  other  hand,  SPAC  requires  an  estimation  of  the 
fundamental  frequency  and  speech  is  not  strictly  periodic 
even  for  voiced  sounds.  Furthermore,  SPAC  can  not  be 
applied  to  unvoiced  sounds  and  even  with  the  revision  made 
by  SPOC,  there  are  some  spectral  degradations  due  to 
replacing  speech  with  the  short  time  correlation  type  of 
function.  The  performance  of  SPAC  or  SPOC  is  not  known. 

II. 2. 8  Wiener  Filtering  Method 

If  y(n)  =  s(n)  +  d(n)  in  which  s (n)  and  a(n)  are 


samples  obtained  from  uncorrelated  stationary  random 
processes  and  if  y(n)  is  available  for  all  time,  it  is 
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well  known  [22]  that  the  optimum  linear  estimator  that 

2 

minimizes  E[(s(n)  -  s  (n) )  ]  in  which  s (n)  represents 
the  estimate  of  s(n)  is  given  by  the  non-causal  Wiener 
filter  whose  frequency  response  is  given  by 


H((j) 


Pe  (“) 

_ s _ 

P“  (u)  +  P T  (ui) 

s  a 


(2-18) 


where  P^ico)  represents  the  power  spectrum  of  x(n). 

Callahan  [23]  approximates  the  non-causal  Wiener 
filter  in  terms  of  the  average  short  time  energy  spectrum 
and  thus 


E[$  (w)] 

~ _ § _ 

"  E[$  (w)  3  +  E[*.  (w)  ] 
S  a 


(2-19) 


in  which  $g(u>)  and  $^(u)  are  given  by  F[b  (n)]  and  F[$^(n)]. 

E[$.(cj)]  can  be  obtained  either  from  the  assumed  known 
d 

rtatistics  of  d(n)  or  by  averaging  many  frames  of  $  (w) 

d 

during  which  noise  can  be  assumed  to  be  stationary. 

E[$  (u>)]  is  estimated  by  subtracting  E[5^(uj)]  from 
locally  averaged  ❖  (to )  over  many  consecutive  windows. 
Callahan  notes  that  to  estimate  E[$  (w)]  within  an  accept¬ 
able  variance,  <S  (w)  should  be  averaged  over  at  least  100 
msec  which  is  a  relatively  long  interval  during  which 
speech  may  not  be  assumed  to  be  stationary.  If  E[i>  (w)] 
estimated  is  negative,  it  is  set  to  zero.  The  short  time 
Fourier  transform  Sw(^)  is  then  estimated  by  multiplying 
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Y^tuj)  with  H (a) )  given  in  equation  (2-19).  Thus  in  this 

system,  |s  (to )  j  is  estimated  by  |y  (oj)  [  *H(to)  where 
w  w 

A 

H(w)  is  obtained  from  equation  (2-19)  and  ^ S^.  ( to )  is 
assumed  to  be  £Yw(oj).  In  the  specific  algorithm  by 

A 

Callahan,  only  one  point  of  s  (n)  is  obtained  from  the 

w 

A 

estimated  Sw(u)  and  the  window  slides  through  y(n)  by 

one  point  at  a  time.  The  performance  of  this  system 

reported  by  Callahan  is  summarized  in  Section  II.  3. 5. 

It  appears  that  there  are  a  number  of  other  ways  to 

obtain  E[4  (u>)]  used  in  estimating  H(io)  in  equation 

(2-19).  Instead  of  averaging  $^(oj)  over  100  msec,  an 

equally  reasonable  way  appears  to  be  to  perform  some  kind 

of  smoothing  on  4>  (w)  and  assume  the  smoothed  $  (w)  to 

Y  y 

be  E[$y(ui)].  Also,  if  we  want  to  generalize  the  Wiener 
filtering  method  arbitrarily  as  was  done  in  the  case  of 
the  correlation  subtraction  method,  there  are,  of  course, 
numerous  possibilities. 

II.  2. 9  Summary 

In  this  section,  various  speech  enhancement  systems 
discussed  in  Section  II. 2  are  briefly  summarized.  The 
comb  filtering  method  is  an  attempt  to  increase  the  S/N 
ratio  based  on  the  periodicity  of  voiced  sounds.  SPAC 
or  SPOC  is  based  on  the  notion  that  in  the  correlation 
domain  the  effect  of  the  background  noise  is  typically 
more  pronounced  near  the  origin  while  speech  repeats  itself 


in  each  pitch  period.  In  generating  speech  in  SPAC  or 

SPOC,  the  notion  that  voiced  sounds  are  periodic  and  the 

spectral  contents  of  one  period  of  speech  is  closely 

related  to  one  period  of  its  correlation  is  exploited. 

All  other  methods  discussed  in  Section  II. 2  differ 

primarily  in  how  <p  (n)  or  Is  (to)  I  is  estimated  and  how 

s  w 

/V  a 

speech  is  generated  based  on  cf) s  ( n )  or  |S  (u>)  |.  Their 
differences  are  summarized  in  Table  2.1. 

II. 3  Summary  of  Performance  Evaluation 

II. 3.1  Adaptive  Comb  Filtering  Method 

Speech  enhancement  techniques  related  to  comb  filtering 
have  been  evaluated  more  extensively  relative  to  other 
techniques.  Using  Frazier's  system  [11],  Perlmutter  [4] 
processed  some  speech  material  that  consist  of  nonsense 
sentences  and  performed  intelligibility  tests  with  inter¬ 
ference  consisting  of  the  speech  of  a  competing  talker. 

Her  results  indicate  that  even  with  accurate  fundamental 
frequency  information,  the  adaptive  comb  filtering  method 
decreases  intelligibility  at  the  S/N  ratios  where  the 
intelligibility  of  unprocessed  nonsense  sentences  range 
between  20  to  70%. 

As  a  part  of  this  research,  Frazier's  adaptive  comb 
filtering  method  with  the  improvement  made  by  Lim  [5]  has 
been  evaluated  by  using  nonsense  sentences  as  test  materials 
when  the  interference  is  wide  band  random  noise.  In  Figure 


fable  2.1  Continued 
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2.4  is  shown  the  results  of  the  intelligibility  test  as 
a  function  of  the  S/N  ratio  and  the  length  of  the  adaptive 
comb  filter.  The  results  of  the  test  show  that  even  with 
carefully  hand  edited  pitch  information,  an  adaptive  comb 
filtering  method  tends  to  decrease  the  speech  intelligi¬ 
bility  at  the  S/N  ratios  where  the  intelligibility  scores 
of  unprocessed  nonsense  sentences  range  between  20  and  70%. 
Since  in  practice  accurate  pitch  information  is  not 
available  and  can  not  be  expected  to  be  obtained  from 
degraded  speech,  the  intelligibility  scores  will  be  even 
lower  than  shown  in  Figure  2.4. 

The  evaluation  results  of  the  systems  by  Parsons 
is  not  available.  However,  an  informal  listening 
indicates  that  the  performance  is  similar  to  Frazier's 
system  when  applied  to  a  competing  speaker  environment. 

1 1. 3. 2  Correlation  Subtraction  Method  and  INTEL 
System 

As  we  discussed  in  Section  II.  2,  the  INTEL  system  is 
in  a  sense  an  arbitrary  generalization  of  the  correlation 
subtraction  method.  More  specifically,  the  case  when  a=2 
in  the  INTEL  system  corresponds  to  the  correlation  subtrac¬ 
tion  method.  In  this  research,  the  performance  of  the 
INTEL  system  in  Figure  2.3  has  been  evaluated  [6]  by 
using  nonsnese  sentences  as  test  materials  when  the 
interference  is  wide  band  random  noise.  This  study  was 
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mo  tiva  ted  primarily  by  the  subjective  impression  that 
substantial  noise  reduction  was  achieved  by  the  INTEL 
system.  In  Figure  2.5  is  shown  the  results  of  the 
intelligibility  test  as  a  function  of  the  S/N  ratio  and 
the  constant  "a".  The  results  of  the  test  show  that  the 
system  does  not  increase  the  speech  intelligibility  at 
the  S/N  ratios  where  the  intelligibility  scores  of 
unprocessed  nonsense  sentences  range  between  20  and  70%. 
Based  on  our  informal  subjective  judgement,  however,  the 
processed  speech  by  the  INTEL  system  sounds  "less  noisy" 
and  of  higher  quality  at  relatively  high  S/N  ratios.  Thus 
if  the  system  is  evaluated  at  higher  S/N  ratios,  in  terms 
of  speech  quality  or  as  a  pre-processor  for  a  bandwidth 
compression  system,  then  the  system  may  be  found  to  be 
useful.  There  is  some  indication  that  the  above  may  be 
true,  as  will  be  discussed  in  the  next  section. 

II. 3. 3  SABER  Method 

Boll  reported  [17]  the  results  of  a  very  preliminary 
evaluation  of  the  SABER  method,  which  corresponds  to  a=l 
of  the  INTEL  system.  His  results  by  the  Diagnostic  Rhyme 
test  indicate  that  at  the  S/N  ratio  at  which  the  intelli¬ 
gibility  score  of  the  unprocessed  speech  material  is 
about  84%  the  SABER  method  does  not  increase  speech  intell¬ 
igibility  which  is  consistent  with  our  results  of  the 
INTEL  system  with  a=l.  However,  when  speech  quality  is 
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testad  [17]  or  the  SABER  method  is  used  as  a  pre-processor 
of  a  bandwidth  compression  system,  some  improvement  is 
noted  at  the  above  S/N  ratio . 

II. 3. 4  Other  Generalizations  of  Correlation 
Subtraction  Method 

Even  though  an  extensive  intelligibility  test  has  not 
been  performed  to  evaluate  the  system  discussed  in 
Section  II. 2. 6  (|sJ('jj)|2  =  ]  Yw  ( co )  |  2  -  k -E  [  |  Dw  (w)  |  2] )  , 
based  on  an  informal  listening  test  it  appears  that  the 
performance  of  this  system  is  similar  to  the  INTEL  system, 
with  a  higher  value  of  k  generally  corresponding  to  a 
smaller  value  of  a.  For  a  wide  ranging  S/N  ratios  (below 
approximately  5  db) ,  a  value  of  k  less  than  2  appears  to 
be  better.  A  large  value  of  k  at  low  S/N  ratios  has  the 
effect  of  essentially  eliminating  the  unvoiced  sounds 
and  higher  formants  of  voiced  sounds.  Further  details 
on  the  performance  of  this  system  will  be  discussed  later 
in  this  thesis. 

The  system  by  Schwartz  et  al.  which  has  an  additional 
parameter  3  is  reported  [19]  to  eliminate  some  perceptually 
unpleasant  speech  degradation  in  the  processing  by  a 
proper  choice  of  3. 

II.  3. 5  Wiener  Filtering  Method 

Callahan  applied  the  Wiener  filtering  method  discussed 


in  Section  II. 2. 8  to  reduce  surface  noise  of  a  1907 
recording  by  Enrico  Caruso  and  reported  [23]  that  the 
technique  "greatly  reduces"  the  surface  noise.  The  per¬ 
formance  of  the  system  when  applied  to  enhance  noisy 
speech  is  not  known. 

II . 4  Bandwidth  Compression  Systems  of  Noisy  Speech 

Our  discussions  in  Sections  II. 2  and  II. 3  have 
been  primarily  concerned  with  speech  enhancement  systems. 
However,  most  of  the  discussions  apply  equally  well  to 
the  bandwidth  compression  systems  of  noisy  speech,  since 
the  two  are  closely  related  to  each  other,  as  we  discussed 
in  Chapter  I.  A  successful  speech  enhancement  system  can 
in  general  be  used  as  a  part  of  a  bandwidth  compression 
system  of  noisy  speech.  This  point  is  obvious  for  a  class 
of  speech  enhancement  systems  based  on  an  analysis/synthesis 
system.  Alternatively,  a  successful  speech  enhancement 
system  can  potentially  be  used  as  a  pre-processor  for  a 
bandwidth  compression  system  of  noise-free  speech,  in  which 
case  we  can  represent  an  overall  bandwidth  compression 
system  of  noisy  speech  as  shown  in  Figure  2.6. 

In  some  cases,  the  system  in  Figure  2.6  can  be 
simplified.  For  example,  a  speech  enhancement  system  such 
as  the  correlation  subtraction  method  is  directed  towards 
estimating  1 Sr  (x) ,  more  accurately.  In  a  bandwidth  compression 
system  such  as  an  L?C  vocoder  [24,25],  a  homomorphic 
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Noisy  Speech 


Speech  Enhancement 
System 


Speech  Synthesis 
Parameters 


Enhanced  Speech 


Bandwidth  Compression 
System  of 
Noise-free  Speech 


Figure  2.6  The  analysis  part  of  a  bandwidth  compression 
system  of  noisy  speech  when  a  speech  enhancement  system 
is  used  as  a  pre-processor 


Noisy  Speech  — 

Speech  Synthesis 
Parameters 


Estimation  of  ISw(tu) 


Estimation  of  Speech 
Synthesis  Parameters 


Figure  2.7  A  possible  simplification  of  the  system  in 
Figure  2.6  for  some  cases.  See  the  text  for  the  details 
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vocoder  [26]  and  a  spectral  root  vocoder  [27],  |s  (w)| 
can  be  directly  used  to  obtain  the  speech  synthesis  para¬ 
meters.  Then  the  system  in  Figure  2.6  can  be  simplified 
to  Figure  2.7.  The  main  advantage  of  the  system  in 
Figure  2.7  relative  to  the  system  in  Figure  2.6  is  the 
computational  simplicity  in  that  the  speech  generation 

/v 

process  from  |s  (u) |  in  the  speech  enhancement  system  can 
be  avoided.  A  disadvantage  is  that  an  existing  bandwidth 
compression  system  of  noise-free  speech  has  to  be  modified. 

From  the  above  discussions,  any  speech  enhancement 
system  discussed  in  Section  II. 2  may  be  used  in  one  form 
or  another  for  the  bandwidth  compression  of  noisy  speech. 
Little  data  exist  in  the  literature  on  the  performance 
evaluation  of  such  a  bandwidth  compression  system. 

II. 5  Motivation  for  a  New  Approach 

In  this  chapter,  we  have  summarized  various  speech 
enhancement  systems  previously  proposed.  Even  though  the 
list  of  the  speech  enhancement  systems  summarized  in 
Section  II. 2  is  not  complete,  they  illustrate  the  basic 
philosophy  behind  currently  available  speech  enhancement 
systems  and  raise  a  number  of  important  questions.  One 
question  is  in  the  incorporation  of  mors  knowledge  of  speech. 
As  we  have  seen  in  Section  II. 2,  the  speech  enhancement 
systems  previously  proposed  are  typically  based  on  the  per¬ 
iodicity  of  voiced  sounds,  uncorrelation  of  speech  with 
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the  background  noise  or  the  importance  of  the  short  time 
spectral  information  for  the  human  speech  perception. 

A  natural  question  is  if  other  knowledge  of  speech  can  be 
incorporated  in  developing  speech  enhancement  systems. 

Another  question  is  in  how  we  incorporate  what  we  know 
about  speech.  As  we  discussed  in  Chapter  I,  it  is 
desirable  to  incorporate  our  knowledge  of  speech  in  a 
manner  consistent  with  the  goal  of  speech  enhancement. 

In  the  speech  enhancement  systems  previously  proposed,  a 
serious  attempt  has  not  been  made  to  "optimally"  incor¬ 
porate  what  we  know  about  speech.  A  third  question  is  on 
developing  a  bandwidth  compression  system.  In  our  discus¬ 
sions  of  the  bandwidth  compression  systems  of  noisy 
speech  in  Section  II.  4,  we  have  considered  using  the  speech 
enhancement  systems  as  pre-processors.  Such  a  system 
typically  requires  generating  enhanced  speech  and  then 
using  the  enhanced  speech  as  input  to  a  bandwidth  compression 
system  of  noise-free  speech.  A  natural  question  that 
arises  is  if  we  can  estimate  the  speech  synthesis  parameters 
directly  from  the  noisy  speech. 

In  this  dissertation,  we  develop  systems  for  enhancement 
and  bandwidth  compression  of  noisy  speech  by  attempting 
to  estimate  the  speech  synthesis  parameters  directly  from 
the  noisy  speech  based  on  a  well  known  estimation  procedure. 
Such  as  approach  leads  to  the  incorporation  of  more  knowledge 
of  speech  in  an  "optimum"  manner.  In  the  next  chapter,  we 
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discuss  the  basic  approach  taken  in  this  thesis  for 
enhancement  and  bandwidth  compression  of  noisy  speech. 
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CHAPTER  III  MODEL  OF  SPEECH  AND  ITS  PARAMETER 

ESTIMATION 

III.l  Introduction 

Many  successful  speech  processing  systems  rely  at 
least  to  some  extent  on  a  model  of  the  speech  as  the 
response  of  a  quasi-stationary  linear  system  to  a  pulse¬ 
like  excitation  for  voiced  sounds  or  a  noise-like  excita¬ 
tion  for  unvoiced  sounds.  To  develop  systems  for  enhance¬ 
ment  and  bandwidth  compression  of  noisy  speech,  it  is 
reasonable  to  capitalize  on  the  underlying  speech  model. 

Thus  in  this  chapter,  we  formulate  the  problem  of  speech 
enhancement  and  bandwidth  compression  of  noisy  speech  as  a 
parameter  estimation  problem  of  the  speech  model  parameters. 
In  Section  III. 2,  we  present  the  model  of  speech  which  has 
been  studied  in  great  detail  [7,13]  and  has  been  used 
extensively  [7,13]  in  many  practical  applications.  In 
Section  III.  3,  we  represent  the  speech  model  discussed  in 
Section  III.  2  in  several  different  forms  which  we'll  find 
useful  in  the  later  chapters.  In  Section  III. 4,  we  discuss 
the  model  of  noisy  speech  and  its  several  different  repre¬ 
sentations.  In  Section  III. 5,  we  review  briefly  the  theory 
of  the  general  parameter  estimation  problem  and  three 
standard  estimation  rules  that  have  been  studied  extensively 
in  the  literature.  In  Section  III.  6,  we  discuss  the  esti¬ 
mation  of  the  speech  model  parameters  and  its  relation  to 
the  problem  of  enhancement  and  bandwidth  compression  of 


noisy  speech. 

III. 2  Model  of  Speech 

A  digital  model  of  sampled  speech  that  has  been  used 
in  a  number  of  practical  applications  and  has  a  basis 
[7,13]  in  the  human  speech  production  system  is  shown 
in  Figure  3.1.  In  the  model,  the  excitation  source  is 
either  a  quasi-periodic  train  of  pulses  for  voiced  sounds 
or  random  noise  for  unvoiced  sounds.  The  digital  filter 
represents  the  effects  of  the  vocal  tract,  lip  radiation, 
and  in  addition  the  glottal  source  in  the  case  of  voiced 
sounds.  Since  the  vocal  tract  changes  in  shape  as  a  func 
of  time,  the  digital  filter  in  Figure  3.1  is  in  general 
time  varying.  However,  over  a  short  interval  of  time, 
we  may  approximate  the  digital  filter  as  a  linear  time 
invariant  system  that  can  be  represented  as 

H  ( z )  =  G ( z ) • V ( z )  • R ( z )  for  voiced  sounds 

V(z)*R(z)  for  unvoiced  sounds 

where  G(z),  V(z)  and  R(z)  represent  the  effects  of  the 
glottal  source,  the  vocal  tract  and  the  lip  radiation, 
respectively . 

In  general  H(z)  consists  of  both  poles  and  zeroes. 
However,  for  non-nasal  voiced  sounds,  H(z)  can  be  shown 
[7]  to  be  reasonably  well  modelled  by  an  all  pole  system. 
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Furthermore ,  even  for  those  cases  such  as  nasal  sounds  or 
unvoiced  sounds  in  which  H(z)  can  not  adequately  be 
modelled  [7]  as  an  all  pole  system,  experience  [7,13] 
has  shown  that  speech  analysis  based  on  an  all  pole 
system  H(z)  leads  to  many  useful  results  and  speech  synthe¬ 
sized  based  on  the  all  pole  model  is  highly  intelligible 
and  of  high  quality.  Since  the  analysis  in  general  is 
much  simpler  for  an  all  pole  system  than  a  more  general 
system  that  includes  zeroes  as  well  as  poles,  H(z)  will 
be  modelled  as  an  all  pole  system.  Thus  in  this  thesis, 
speech  is  modelled  on  the  short  time  basis  as  the  response 
of  a  stationary  all  pole  system  to  a  pulse-like  excitation 
for  voiced  sounds  or  a  noise-like  excitation  for  unvoiced 
sounds . 

III. 3  Representations  of  the  Model  of  Speech 

The  model  of  speech  discussed  in  Section  III.  2  can 
be  represented  in  many  different  forms.  In  this 
section,  we  discuss  four  different  representations  of 
the  speech  model. 

In  the  speech  model  discussed  in  Section  III. 2, 
the  transfer  function  H(z)  is  modelled  to  be  all-pole 
of  the  form 


Thus,  on  a  short  time  basis  the  speech  waveform  s(n)  is 
assumed  to  satisfy  a  difference  equation  of  the  form 


where 
it  is 
form 

and  a 


and  s 


s(n)  =  T  a,  *s(n-k)  +  u(n) 
k=l  * 


(3-2) 


u(n)  is  a  pulse  train  or  random  noise.  Notationally , 
convenient  to  represent  equation  (3-2)  in  a  matrix 
as 


s(n)  =  a  .s(n-l,n-p)  +  u(n) 


(3-3) 


is  the  parameter  vector 


a  =* 


!3-4: 


(n.,n_)  denotes  the  vector  of  speech  samples 

s  (n. 


s(n],,n2)  = 


\  s (n2) 


:3-5) 


iA  summary  of  various  notations  used  throughout  the  thesis 
is  in  Appendix  1. 


The  vector  of  observations  is  assumed  to  consist  of  N 
values  s  ( N-l )  ,  s (N-2 )  ,  .  .  .  ,  s(Q),  i.e.,  s_(  N-l ,  0)  , 
which  will  be  denoted  by  sQ.  Equation  (3-3)  for 
0  <  n  <  N-l  is  one  representation  of  the  speech  model. 

Equation  (3-3)  can  be  represented  in  various 
different  forms.  One  form  comes  from  rewriting  equation 
(3-3)  as 

S(N-1,0)  =  A  •  s_(N-l ,  0 )  +  Aj-Sj  +  u  (N-l ,  0)  (3-6 

where  A  is  an  NxN  matrix  given  by 


and  Ax  is  an  Nxp  matrix  given  by 


and  is  a  pxl  matrix  given  by 


s(-l) 
s  (-2) 


(3-6d) 


Therefore , 

s(N-l,0)  =  (I-A)~*  Aj'S  +  (I-A) -1u(N-l,0)  (3-6e) 

Equation  (3-6)  is  another  representation  of  the  speech 
mode  1 . 

Two  other  forms  can  be  derived  by  representing  equation 


(3-3)  in  a  state  space  form  as  shown  in  the  following 
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eauation : 


x(n)  =  F(n)*x(n-1)  +  G(n)*u(n) 
z(n)  =  H(n)*x(n)  +  v(n)  for  0<n<N-l 


(3-7) 


where 


x (n)  is  a  state  vector. 


z.(n)  is  an  observation  vector, 
u(n)  is  an  excitation  vector, 
v(n)  is  an  observation  noise  vector, 
x(-l)  is  an  initial  condition  vector. 


Equation  (3-3)  can  be  represented  in  the  form  of  equation 
(3-7)  by  using  a  as  a  state  vector  and  thus 


a(n)  =  a(n-l) 

s(n)  =  sT  (n-1 ,  n-p)  •  a  (n)  +  u(n)  for  0<n<N-l  (3-8) 


Alternatively,  s_(n,n-p+l)  can  be  used  as  a  state 
vector  x(n)  and  thus 


x(n)  =  F*x(n-1)  +  G*u(n) 


s  (n)  =  H* x  (n) 


for  0<n<N-l 


(3-9a) 


i, 
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I. 


in  our  later  discussions. 

III. 4  Model  of  Noisy  Speech  and  its  Representations 

When  the  background  noise  is  added  to  speech,  the 
noisy  speech  can  be  represented  as 

y(n)  =  s(n)  +  d(n)  (3-10a) 

where  y(n)  represents  noisy  speech  and  d(n)  represents 
the  background  noise  or  disturbance.  The  observation 
vector  ^(N-1,0)  which  will  alternatively  be  denoted  as 
,  then,  consists  of  the  sum  of  speech  and  background 
noise,  i.e.. 


£(N-1,0)  =  s  (N-l ,  0 )  +  d  (N-l , 0)  ( 3-10b) 

Combining  equations  (3-3)  and  (3-10)  , 

T  T 

y(n)  =  a  *y_(n-l,n-p)  -  a  *d(n-l,n-p)  +  u(n) 

for  0  <_  n  <_  N-l  (3-11) 

Like  equation  (3-3) ,  equation  (3-11)  can  alternatively 
be  represented  in  various  different  forms.  Two  convenient 
representations  which  parallel  equations  (3-6e)  and  (3-9a) 
are 

y_(N-l,0)  =  (I-A)  "1*AI*s  +(I-A)  '"1  •  u  (N-l ,  0 )  +d  (N-l ,  0 )  (3-12) 
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where  A,  A^  and  are  given  by  equation  (3-6)  and 

x(n)  =  F’x(n-l)  +  G*u(n) 

y  (n)  =  H-x(n)  +  d(n)  for  0  £  n  <_  N-l  (3-13) 

where  x(n) ,  F,  G  and  H  are  given  by  equation  (3-9) . 

Equation  (3-11),  (3-12)  or  (3-13)  represents  the  model 
of  noisy  speech  and  will  be  found  to  be  useful  in  the 
later  disucssions. 

III. 5  Review  of  Parameter  Estimation  Theory 

In  this  section,  we  review  very  briefly  the  general 
parameter  estimation  theory.  Let  A  and  R  denote  the 
parameter  space  and  the  observation  space,  and  suppose 
that  there  is  a  probabilisitc  mapping  between  the  para¬ 
meter  space  and  the  observation  space.  Assume  that  a 
point  a  in  the  parameter  space  was  mapped  to  a  point  r 
in  the  observation  space.  The  parameter  estimation  problem 
is  to  estimate  the  value  of  a  after  observing  r  by  some 
estimation  rule. 

Three  different  estimation  rules  known  as  Maximum 
Likelihood  (ML) ,  Maximum  A  Posteriori  (MAP)  and  Minimum 
Mean  Square  Error  (MMSE)  estimation  have  many  desirable 
properties  and  thus  have  been  studied  [22,28]  extensively 
in  the  literature.  For  non-random  parameters,  the  ML 
estimation  rule  is  often  used.  In  the  ML  estimation,  the 
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parameter  value  is  chosen  such  that  the  chosen  value  most  like¬ 
ly  resulted  in  the  observation  r.  Thus,  the  value  of  a  is 
chosen  such  that  pR|A(r|a),  the  probability  density  function 
of  R  conditioned  on  A,  is  maximized  at  the  observed  r  and  the 
chosen  value  of  a.  The  MAP  and  MMSE  estimation  rules  are 
commonly  used  for  the  parameters  that  can  be  considered  as 
random  variables  whose  a  priori  density  function  is  known. 

In  the  MAP  estimation  rule,  the  parameter  value  is  chosen 
such  that  the  a  posteriori  density  pA|R(a|r)  is  maximized  at 
the  observed  r  and  the  chosen  value  of  a.  Even  though  the  MAP 
estimation  rule  is  based  on  a  random  parameter  assumption  and 
the  ML  estimation  rule  is  based  on  a  non-random  parameter  as¬ 
sumption,  the  two  estimation  rules  lead  to  identical  estimates 
of  the  parameter  value  when  the  a  priori  density  of  the  para¬ 
meter  in  the  MAP  estimation  rule  is  assumed  to  be  flat  over 
the  parameter  space.  For  this  reason,  the  ML  estimation  rule 
is  often  viewed  as  a  special  case  of  the  MAP  estimation  rule. 

In  the  MMSE  estimation  rule  a(R) ,  the  estimate  of  a,  is  ob- 

~  2 

tained  by  minimizing  the  mean  square  error  E[(a(R)-a)  ].  The 
MMSE  estimate  of  a  is  given  by  E[a|r],  the  a  posteriori  mean 
of  a  given  r.  Therefore,  when  the  maximum  of  the  a  posteriori 
density  function  p A  j  R ( ct  |  r )  coincides  with  its  mean,  the  MAP 
estimation  and  MMSE  estimation  rules  lead  to  identical  esti¬ 
mates. 

The  three  estimation  procedures  briefly  discussed 
above  have  been  applied  [22,28]  to  a  number  of  practical 


parameter  estimation  problems.  Detailed  discussions  on 
their  properties,  relations  and  application  areas  can 
be  found  in  [22,28]. 

III. 6  Estimation  of  Speech  Model  Parameters 

The  model  of  speech  discussed  in  Section  III.  2  is 
completely  specified  if  we  determine  the  parameters  related 
to  the  excitation  u(n)  and  the  system  parameters  a  in 
H(z)  of  equation  (3-1).  The  basic  problem  that  has  been 
considered  in  this  dissertation  is  the  estimation  of 
the  all  pole  coefficients  a^ . 

Ideally,  the  all  pole  coefficients  should  be 
estimated  based  on  a  rule  consistent  with  the  subjective 
aspects  of  speech.  Since  a  function  of  a  that  relates 
the  degree  of  speech  degradation  in  the  subjective  domain 
is  not  well  understood,  developing  such  an  estimation  rule 
is  difficult.  However,  we  may  attempt  to  use  other  well 
known  estimation  rules  discussed  in  Section  III. 5  which 
are  optimum  in  a  different  sense  but  which  have  been 
successfully  applied  to  a  number  of  other  practical 
problems.  In  this  dissertation,  we  take  the  approach  to 
use  the  MAP  estimation  procedure.  The  parameter  to  be 
estimated  is  a  and  the  observation  is  the  noisy  speech. 

The  MAP  estimation  procedure  is  based  on  the  philo¬ 
sophy  to  maximize  p(ajy^)  where  a  and  represent  the 
all  pole  coefficient  vector  and  the  noisy  speech  vector. 
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The  approach  to  use  the  MAP  estimation  procedure  to  estimate 
the  all  pole  coefficients  has  a  number  of  advantages. 

First,  the  procedure  and  properties  of  the  MAP  estimation 
are  well  established  [22]  and  can  be  applied  to  speech 
processing.  Second,  the  Maximum  Likelihood  (ML) 
estimation  procedure  can  be  viewed  as  a  special  case  of 
the  MAP  estimation  procedure  since  the  two  estimates  are 
the  same  when  the  a  priori  density  of  a  is  assumed  to  be 
flat.  One  property  of  the  ML  estimation  which  is  useful 
for  speech  processing  is  that  if  f (a)  has  a  one  to  one 

A  A  /S 

correspondence  with  a,  then  f . .T  ( a )  =  f(a,._)  where  a„_ 
represents  the  ML  estimate  of  a.  Therefore,  if  the  percep¬ 
tually  important  parameters  have  a  one  to  one  correspondence 
with  a,  then  the  ML  estimates  for  such  perceptually  impor¬ 
tant  parameters  are  automatically  obtained  by  obtaining 
a^.  Further,  as  will  be  discussed  in  greater  detail  in 

A 

Chapter  IV,  for  noise-free  speech  a  under  appropriate 
assumptions  are  equivalent  to  the  a  obtained  by  the  covar¬ 
iance  [1,29]  or  correlation  [7,29]  method  both  of  which 
have  been  successfully  applied  to  the  Linear  Predictive 
Coding  of  speech.  Third,  the  MAP  estimation  procedure 
provides  a  theoretical  framework  in  which  some  a  priori 
information  about  a  can  be  incorporated.  Due  to  the 
temporal  and  spectral  characteristics  of  speech,  seme  a 
priori  information  of  the  all  pole  coefficients  a  when 
properly  incorporated  may  in  fact  aid  in  estimating  a. 
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In  estimating  a  by  the  MAP  estimation  procedure,  the 
excitation  u(n)  is  assumed  uc  bo  zero-mean  white  Gaussian 
noise.  In  the  context  of  the  speech  model  discussed  in 
Section  III. 2,  this  assumption  is  valid  only  for  unvoiced 
speech  since  the  excitation  is  assumed  to  be  random  noise. 
There  are  several  reasons  behind  this  particular  choice  of 
the  excitation.  First,  the  analysis  of  the  MAP  estimation 
procedure  is  relatively  simple  in  the  case  of  the  random 
noise  excitation  if  the  excitation  is  assumed  to  be 
generated  by  a  white  Gaussian  process.  The  case  when  the 
excitation  is  a  pulse  train  is  considerably  more  difficult. 
Second,  as  will  be  discussed  in  Chapter  IV,  in  the  absence 
of  background  noise  with  the  excitation  treated  random 
one  set  of  the  MAP  estimation  procedures  corresponds 
exactly  to  the  linear  prediction  analysis  which  is  well 
known  to  be  successful  for  both  voiced  and  unvoiced 
speech.  Further,  as  will  be  discussed  in  Chapters  VII 
and  VIII,  the  theoretical  results  developed  in  the  thesis 
for  the  system  parameter  estimation  in  the  presence  of 
background  noise  when  the  excitation  is  random  noise  can 
be  applied  with  similar  performance  to  the  case  of  the 
pulse  train  excitation. 

If  the  all  pole  coef f icients  can  be  "better"  estimated 
through  the  MAP  estimation  procedure  by  accounting  for  the 
presence  of  noise,  then  we  in  fact  have  a  better  bandwidth 
compression  system  of  noisy  speech  in  the  context  of  an  LPC 
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vocoder.  Even  though  a  complete  vocoding  system  requires 
the  estimation  of  the  source  parameters  as  well  as  the 
system  parameters,  the  problem  of  estimating  the  source 
parameters  accounting  for  the  presence  of  background 
noise  will  not  be  treated  in  this  thesis.  For  the 
enhancement  of  noisy  speech,  there  are  two  ways  that 
the  estimation  of  a  can  lead  to  speech  enhancement.  If 
we  in  fact  have  a  successful  bandwidth  compression 
system,  then  the  bandwidth  compression  system  itself  can 
be  used  as  a  speech  enhancer.  Alternatively,  in  the 
systems  that  we  develop  for  the  estimation  of  the  all 
pole  coefficients,  the  speech  s(n)  is  estimated  in  the 
process  of  estimating  the  all  pole  coefficients.  Thus  if 

/s 

speech  enhancement  is  desired,  then  the  estimated  s(n) 
can  be  used  as  the  enhanced  speech.  The  fact  that  s(n) 
is  also  estimated  is  important  not  only  in  the  context  of 
speech  enhancement,  but  in  the  context  of  bandwidth  compres¬ 
sion  of  noisy  speech.  If  we  estimate  only  the  all  pole 
coefficients,  then  we  are  limited  to  a  class  of  vocoding 
systems  known  as  LPC  vocoders.  Since  speech  is  estimated 
as  well  as  the  all  pole  coefficients,  the  systems  developed 
can  also  be  sued  as  pre-processors  for  any  vocoding  system. 
Therefore,  the  systems  developed  in  this  thesis  are 
potentially  applicable  for  both  bandwidth  compresssion 
through  a  variety  of  vocoding  systems  and  speech  enhance¬ 


ment  of  noisy  speech. 
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CHAPTER  IV  STATISTICAL  PARAMETER  ESTIMATION  FROM 

NOISE-FEEE  SPEECH 

IV. 1  Introduction 

In  this  chapter,  we  review  and  relate  various  ways 
of  estimating  the  speech  model  parameters  from  the  noise- 
free  speech.  In  Section  IV. 2,  the  problem  of  parameter 
estimation  from  the  noise-free  speech  is  formulated.  In 
Sections  IV. 3  and  IV. 4  are  discussed  two  different  approaches 
for  the  same  parameter  estimation  problem  formulated 
in  Section  IV. 2. 

IV. 2  Problem  Formulation 

Speech  is  modelled  as  the  response  of  a  linear 
quasi-stationary  system  to  a  noise-like  excitation. 

From  equation  (3-3)  with  u(n)  corresponding  to  white 
Gaussian  noise, 

T 

s(n)  =  a  ♦s(n-l,n-p)  +  g*w(n)  (4-1) 

where  w(n)  is  white  Gaussian  noise  with  zero  mean  and 
unit  variance  (i.e.,  E[w(n)j  =  0  and  E  [w  (n)  *w  (rti)  ]  =  5  (n-m)  )  . 

Equation  (4-1)  implies  that  s(n)  depends  on  a  total 
of  2p+l  parameters,  specifically  the  p  values  in  the 
coefficient  vector  a,  the  initial  conditions  _s_  =  s_ ( —  1 ,  —  p )  , 
and  the  gain  factor  g.  We  assume  that  these  unknown  para¬ 
meters  are  random  with  associated  a  priori  Gaussian  prcbabil- 
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ity  densities.  The  basic  problem  treated  in  this  thesis 
is  to  estimate  the  system  parameters  a^  from  the  observa¬ 
tion  vector  s^  by  the  MAP  estimation  procedure .  Thus 

the  system  parameters  a  are  chosen  to  maximize  p(ajs^), 

2 

the  probability  density  function  of  a  conditioned  on  s Q. 
There  are  several  approaches  that  can  be  taken  in  maximiz¬ 
ing  p(a|s0).  In  Sections  IV. 3  and  IV. 4,  we  consider  two 
different  approaches. 


IV.  3 


Direct  Approach:  Maximization  of  P(£|.Sq) 
p(a|s^)  can  be  written  as 

~\J 


=  /  /  pCa/g,^1,^)  dg  ds_ 

over  g 
and  s^ 


From  Hayes'  rule,  p 

?  (a , g , s  _  |  s  ) 

—  —  x  — U 


(a , g,£z i is 


?  (  £q  I  — '  g '  — I 


P(s0) 


given  by: 

p (a , g ,£z ) 


(4-2) 


(4-3) 


The  conditional  density  function  p(sQla,g  ,s_^)  can  be 
evaluated  by  noting  that 


For  a  more  accurate  representation,  a  probability  density 

function  p,  ( • )  and  the  density  function  evaluated  at  x=xn 
x  0 

should  be  distinguished.  For  the  notational  convenience, 

?(xQ)  will  be  used  in  both  cases  and  _ne  distinction  will 

be  left  to  the  context  in  which  it  is  used. 
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? ( Sq |  a , g , s ! )  =  p (s (N-1,0)  |  a,g,s (-1,-p) ) 

N-l 

=  ~\J  p (s (n) j a,g,s (n-l, -p) ) 
n=0 

N-l 

=  TT  P(s(n) | a,g,s (n-l ,n-p) )  (4-4) 

n=0 


From  the  model  of  equation  (4-1)  and  the  assumption  that 
w(n)  is  white  Guassian  noise  with  unit  variance, 


u 


p(s(n) | a,g,s (n-l , n-p) ) 


(2*rg 


2,1/2 


ixp(-  — ^  *  (s  (n)  -aT-  s  (n-l,  n-p)  )  2]  (4-5) 


2g‘ 


From  equations  (4-4)  and  (4-5), 


N-l 


p(sja,g,s  )= - [■ - v*  £ 

^  1  (2^gz)N/“  2g“  n=0 


s(n)-aT*s(n-l,n-p))“] (4-6) 


p(a,g,s].)  in  equation  (4-3)  represents  the  a  priori  knowl¬ 


edge  of  the  three  unknown  parameters.  For  a  general  Gaussian 


density  of  ,  it  can  be  shown  that  maximizing 


’Consider  a  special  case  in  which  g  is  known,  sQ=[s(0)] 


and  p=l.  For  a  Gaussian  density  of  p(a,s^),  p(aisQ)  is 
in  the  form  of 

-k0  (a, -k3) 


1  TIFT71 


1  f  ( a  1  ) 


where  k. ,  k_, ,  k  and  k  are  constants.  Maxindzing 
p(a1  I  s  ( 0 ) )  in  the  above  highly  simplified  case  involves 
solving  a  non-linear  equation. 
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p(als-))  given  by  equations  (4-2),  (4-3),  (4-4)  and  (4-6) 

in  general  requires  solving  a  set  of  non-linear  equations. 

The  problem  can  be  made  linear,  however,  by  making 
some  specific  assumptions  of  p(a,g,s_^)  and/or  including 
as  the  parameters  for  estimation  the  auxilliary  parameters 
such  as  g  and  £T  which  are  unwanted  in  the  sense  that  our 
primary  interest  is  in  estimating  a.  In  the  remainder  of 

4 

this  section,  tour  such  cases  are  examined.  In  case  1, 
all  of  the  parameters  a,  g  and  are  jointly  estimated 
assuming  no  a  priori  information  of  the  parameters.  The 
estimate  for  a  that  results  corresponds  exactly  to  the 
covariance  method  of  the  linear  prediction  analysis. 

In  case  2,  s  ^  is  assumed  to  be  known  and  a  and  g  are  estima¬ 
ted  jointly  assuming  no  a  priori  information  of  a  and  g. 
Depending  on  specifically  how  s^  is  assumed  known,  this 
corresponds  to  estimating  a  using  either  the  covariance 
method  or  correlation  method  of  the  linear  prediction 
analysis.  In  case  3,  g  is  assumed  to  be  known  and  a 
and  are  jointly  estimated  assuming  no  a  priori  infor¬ 
mation  of  .  In  case  4,  only  a  is  estimated  assuming  g 
and  s_j.  are  known. 

IV .3.1  Case  1 

In  this  case,  p(a,g,s__  s^)  is  maximized  with  respect 

^These  are  the  only  four  cases  in  which  the  solution  can 
be  obtained  by  solving  a  set  of  linear  equations . 
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to  a,  g  and  s^  with  the  assumption  that  no  a  priori 
information  of  a,  g  or  s^  is  available.  This  corresponds 
to  the  case  when  p(a,  g,^)  is  constant^.  From  equation 
(4-3),  since  p(s^)  is  not  a  function  of  a,  g,  or  and 
p(a,g,Sj)  is  assumed  to  be  constant,  maximizing 
P  (£.,g,|Li  I  Iq)  equivalent  to  maximizing  P  (Sq  |  a,g  ,5^)  . 
Thus,  the  MAP  estimation  of  a,  g  and  s_^  in  the  absence  of 
a  priori  information  reduces  to  the  ML  estimation  of 
those  parameters. 

From  equation  (4-6),  maximizing  P  ( Sq  [  a , g ,  )  with 

respect  to  g  leads  to 


g 


2 


N- 

•  l  (s(n) 
n= 


(n-1 ,n-p) ) 


2 


(4-7) 


Maximization  of  p  (s^  |  a ,g ,  s.^)  with  respect  to  a  and  s^ 
is  equivalent  to  minimizing  z^  given  by 


!  N:l  t  9 

z  =  —*■•)  (s(n)  -  a  *s (n-l,n-p) ) “ 

P  g2  n=0 


(4-8) 


Thus  we  choose  the  parameters  a  and  s^  to  satisfy  the  set 

As  the  variance  becomes  larger,  the  density  function  becomes 
wider  and  flatter  approaching  a  constant.  More  formally, 
however,  it  should  be  assumed  that  p(a,g,s^)  is  Gaussian 

whose  covariance  approaches  an  arbitrarily  large  value. 

In  all  the  cases  in  this  thesis  where  we  assume  that  no 
a  priori  information  of  some  parameters  can  be  modelled  by 
a  uniform  density  of  the  parameters,  it  can  be  shown  that 
:he  same  theoretical  results  are  obtained  by  first  solving 
the  case  of  finite  variance  and  then  letting  the  variance 
approach 


of  equations 


>9 


3  e 

aTFjT  '  0  for  3  =  1'2 . 


(4-9a) 

(4-9b) 


Rewriting  equation  (4-8)  as 


1  P —  rn  p 

eD  =  —  *  l  (s(n)  -  a  *£(n-l,n-p)) 

-  g  n=0 

i  N:L  t  ? 

+  —  •  l  (s(n)  -  a  *£(n-l,n-p) )  ,  (4-10) 

g  n=p 


only  the  first  of  these  summations  involves  the  initial 

condition  vector  £I-  It  is  straightforward  to  show 

algebraically  that  for  any  non-zero  solution  of  the 

parameter  vector  a,  s.  can  be  chosen  so  that  the  first 

summation  in  equation  (4-10)  is  zero.  Since  these  are  the 

values  which  minimize  s  with  respect  to  s_,  thev  would 

P  “I 

then  correspond  to  the  estimate  of  these  parameters.  Sine 

we  are  only  interested  in  explicitly  estimating  the  coeffi 

cient  vector  a,  it  is  not  necessary  to  solve  for  s. ^ . 

Since  the  first  term  in  equation  (4-10)  will  always  be 

zero  when  e  is  minimized,  the  minimization  of  eauation 
P 

(4-10)  corresponds  to  minimizing  with  respect  to  a,  the 
function 

1  T  2 

— 2  •  l  (s(n)  -  a  *s (n-l,n-p) ) 
g  n=p 


(4-11) 
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Setting  the  partial  derivatives  of  equation  (4-11) 
with  respect  to  each  of  the  coefficients  a^  to  zero 
results  in  a  set  of  linear  equations  given  by 

N-l  rn 

l  (s(n)  -  a  *s(n-l,n-p) )  -s(n-i)=0,  i=l,...,p  (4-12) 

n=P 

Equation  (4-12)  corresponds  exactly  to  the  equations 
obtained  by  the  covariance  method  of  the  linear  prediction 
analysis  [7,24] . 


IV. 3. 2  Case  2 

In  this  case,  we  assume  that  the  initial  condition 
vector  s.  is  known  and  no  a  priori  knowledge  of  a  and  g 
is  available.  Then  pCa^g's^)  is  maximized  with  respect 
to  a  and  g.  From  Bayes'  rule, 


?(^,g!s0)  = 


p  (Sq  |  a  ,g)  *p  (a  ,g) 
p'^) 


(4-13) 


and  since  s^  is  assumed  to  be  known  p(s<-)|a,g)  represents 
p(s_ja,g,s_)  evaluated  at  s_  equal  to  its  assumed  known 
value.  Assuming  p(a,g)  is  constant,  maximizing  p(a,gis^) 
is  equivalent  to  maximizing  p(s^;a,g)  corresponding  again 
to  the  ML  estimation  of  a  and  g.  From  equation  (4-6)  with 
known  s , ,  maximization  of  p(s^  a,g)  with  respect  to  g  leads 
to  equation  (4-7)  for  g2  .  Maximization  with  respect  to  a 
is  identical  to  minimizing  given  by  equation  (4-8). 
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However,  the  minimization  is  now  carried  out  with  respect 
to  a  alone.  Comparing  equations  (4-10)  and  (4-11),  we 
see  that  the  function  to  be  minimized  with  respect  to  a 
is  similar  in  both  cases,  differing  only  in  the  lower 
limit  of  the  summation.  The  linear  set  of  equations  for 
a  is  now  given  by 

N;1  T 

l  (s(n)  -  a  *£(n-l,n-p) )  *s (n-i)  »  0 
n=0 

i=l , 2 , . . . , p  (4-14) 

If  the  initial  conditions  are  indeed  known,  then  we 
in  fact  have  available  N+p  observations  of  s(n).  From 
the  N+p  observations,  we  use  the  first  p  observations  to 
form  the  initial  condition  vector  £x  and  the  remaining  N 
observations  to  form  the  observation  vector  .  If  we 
consider  the  relationship  between  case  1  and  case  2  on 
the  basis  of  the  same  total  number  of  observations,  then 
in  fact  they  lead  to  identical  functions  to  be  minimized 
and  consequently  identical  estimates. 

In  the  above  case,  we  have  assumed  that  p(a,g)  is 
constant  and  s^  is  exactly  known.  Therefore,  maximization 
of  p(a,g|s^)  was  identical  to  maximizing  p(s^|a,g). 

3ecause  maximization  of  pfs^ja^g)  with  respect  to  a  and 
g  in  this  case  corresponds  to  the  ML  estimation  for  a  and 
g  given  (conditioned  on)  the  initial  condition  vector 
s^  =  £ ( — 1 , — p ) ,  it  is  sometimes  referred  to  as  the 
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Conditional  Maximum  Likelihood  (CML)  estimate  of  a. 

As  an  alternative  to  using  the  first  p  observations 
in  each  analysis  frame  to  form  the  initial  condition  vector, 
we  can  assume  that  the  response  was  zero  prior  to  the 
observation  interval.  In  this  case,  assuming  that  we  have 
a  total  of  N  actual  observations,  we  augment  these  with 
p  additional  zero  values.  Now,  if  we  further  extend  the 
data  by  p  points  and  augment  s_(N+p-l,N)  with  zeroes,  then 
maximization  of  p  (a, g  |  s_(N+p-l ,  0)  )  with  respect  to  a  and 
g  leads  to 


N+]d-1 

n=0 


(s(n)-a  *£(n-l ,n-p) ) • s (n-i)  =  0 
for  i=l ,  2  ,  .  .  .  .  ,p 


(4-15) 


and  s_(N+p-l , N)  and  £(-l,-p)  are  all  0_.  This  is  exactly 
the  same  equations  given  by  the  correlation  method  of  the 
linear  prediction  analysis.  In  the  context  of  the  linear 
prediction  analysis,  the  principal  advantage  of  the  correla¬ 
tion  method  over  the  covariance  method  has  been  that  in 
that  case,  the  solution  of  the  set  of  equations  involves 
the  inversion  of  a  Toeplitz  matrix  for  which  there  are 
particularly  efficient  methods  [30]  .  In  addition,  the 
resulting  all-pole  model  is  guaranteed  to  be  stable.  From 
equations (4-12)  and  (4-15)  the  resulting  linear  equations 
to  be  solved  in  both  methods  are  given  by 


* 
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T 

l{ s(n)  -  a  *£(n-l,n-p) ) *s (n-i)  =  0,  1=1,2,. ...,p  (4-16) 

n 

and  the  summation  extends  from  p  to  N-l  for  the  covariance 
method  and  from  0  to  N+p-1  for  the  correlation  method. 


IV. 3. 3  Case  3 

Now  we  consider  the  case  when  g  is  known  so  that 
P^Sj-Isq)  is  maximized  with  respect  to  a  and  s^  and 
no  a  priori  information  of  ss  is  available  so  that 
p(£x)  is  constant.  Assuming  p(a,sx)  =  p(a)*p(£  ),  from 
Bayes'  rule 


p(sJa,s  )  •  p  (a)  •d(st) 
=  —  ■ — ■  ~r 


(4-17) 


where  Pts^ja/S^)  represents  P(sQ!a,g,sI)  evaluated  at  g 
equal  to  its  assumed  known  value.  Since  p(s_-j.)  is  assumed 
constant,  maximizing  p(a,sT|sQ)  is  equivalent  to  maximizing 
P  (Sq \ a, s  )  *p  (a) .  Assuming  that  a  has  a  Gaussian  density 
with  mean  a  and  covariance  function  Pg ,  p(a)  is  of  the 
form 


p  (a)  = 


(2 


,)P/2  • 


■  '  exp [-j(a-a) T*p"1 • (a-a) ] 


(4-is: 


Combining  equations  (4-6) ,  (4-17)  and  (4-18) ,  it  can  be 

seen  that  maximizing  equation  (4-17)  is  equivalent  to 


-91- 


minimizing  given  by 


N-l 


£  =  — T  •  ) 

p  g2  n=0 


(s(n)  -  aT<  s_(n-l,n-p)  )  2  +  (a- a)  T •  •  (a-a) 


1 


(4-19) 


e  in  equation  (4-19)  is  similar  to  £  in  equation  (4-8) 

P  P 

_  m  _ 

or  (4-10)  but  with  the  additional  term  (a-a)'*'*Pg  *  (a-a)  . 

Since  this  extra  term  is  not  a  function  of  s_ ,  minimization 

of  £  in  equation  (4-19)  with  resDect  to  sT  recruires  that 
P  ~L 

s^  be  such  that 


l  (s(n)  -  aT*s (n-l,n-p) ) 2  =  0 
n=0 


Therefore  minimization  of  £^  in  equation  (4-19)  with 

repsect  to  a  reduces  to  minimization  of  £  given  by 

P 

(s  (n)  -  aT-s(n-l,n-p)  )  2  +  (a-a)T. P^1  •  (a-a)  (4-20) 

g  n=p 

Partial  differentiation  with  respect  to  a^  for  i=l,2,....,p 

results  in  a  set  of  linear  equations. 

If  no  a  priori  information  on  a  is  assumed  so  that 
2  2 

Pq  =  od*I  with  arbitrarily  large,  the  a  obtained  in 
this  case  would  be  identical  to  a  in  case  1. 


IV. 3. 4  Case  4 

Now  we  maximize  pCajs^)  with  respect  to  a  assuming 
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that  g  and  are  known.  From  Bayes'  rule, 


P(sja)  p  (a) 

p(a|s„)  =  - ; - c - 

r  — '-O  pts^ 


;4-2i) 


and  since  g  and  ^  are  assumed  known  pts^la)  represents 

p(s<3ja/g,£I)  evaluated  at  g  and  £^  equal  to  their 

respective  assumed  known  values.  Therefore  maximizing 

p(a.|s0)  is  equivalent  to  maximizing  p  ( s^  |  a)  *p  (a)  . 

Assuming  p(a)  is  of  the  form  given  by  equation  (4-18), 

maximizing  p(a|sQ)  in  equation  (4-21)  is  the  same  as 

minimizing  the  same  e  in  equation  (4-19)  ,  which  can 

be  easily  seen  by  comparing  equations  (4-17)  and  (4-21) . 

Here,  however,  we  minimize  £  with  respect  to  a  alone, 

P  — 

which  again  corresponds  to  solving  a  set  of  linear  equations 
The  difference  between  equations  (4-19)  and  (4-20)  is 
in  the  limit  of  the  summation,  analogous  to  the  difference 
between  equations  (4-10)  and  (4-11) .  If  we  assume  no  a 
priori  information  of  a,  then  the  second  term  in  equation 
(4-20)  would  be  eliminated  and  the  estimate  for  a  obtained 
in  this  case  would  be  identical  to  that  obtained  in  case  2. 

If  we  assume  that  £  =  0_  and  further  extend  the 

data  by  p  points  with  0  (i.e.,  s(N+p-l,N)  =  0}  as  we  did 
in  case  2,  then  the  equation  to  be  minimized  is  given  by 


1  N+o-1  T  2  -T-l— 

s  =  — 2  •  L  (s(n)  -  a  -s(n-l,n-p))  +  (a-a)  *?Q  • (a-a) 

-  g  n=0  u 

(4-22) 
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with  £  and  s(N+p-l,0)  both  equal  to  £.  In  the  Uniting 

case,  as  approaches  •  I ,  corresponding  to  no  a  priori 

information  of  a,  the  minimization  of  s  in  equation  (4-22) 

P 

reduces  to  equation  (4-15)  which  corresponds  to  the 
correlation  method  of  the  linear  prediction  analysis. 

In  the  above  discussion,  we  saw  that  maximizing 
p(ajs^)  leads  to  a  set  of  linear  equations  only  when  g 
and  s  are  known.  In  practice  these  parameters  may  not 
be  known  exactly.  However  we  might  expect  to  make  some 
reasonable  guess  of  g  and  s ^ .  Alternatively,  we  can  solve 
the  linear  equations  in  case  1,  assume  that  these 
estimates  of  g  and  s^  are  exact  and  maximize  equation 
(4-21)  with  respect  to  a.  A  third  possibility  for  obtain¬ 
ing  s_£  is  to  use  the  first  p  data  points  as  £^  and  use 
the  remaining  N-p  points  as  s^ ,  which  leads  to  the  same 

— <J 

estimate  of  a  as  in  case  3. 

In  this  section,  we  have  seen  that  maximizing  p(a!=^) 
in  general  is  a  non-linear  problem.  However  the  problem 
can  be  linearized  if  we  make  some  specific  assumptions  abou 
the  a  priori  density  of  the  parameters  and/or  include  as 
parameters  for  estimation  some  auxilliarv  parameters  such 
as  g  and  £  .  As  will  be  discussed  in  Chapter  V,  the  notion 
of  including  as  parameters  for  estimation  some  auxilliarv 
parameters  and  making  seme  specific  assumptions  of  the  a 
priori  information  on  the  parameters  will  again  lead  to  two 
linear  implementations  when  we  deal  with  the  statistical 
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l.  • 


parameter  estimation  from  noisy  speech.  In  Section 

IV. 4,  we  investigate  an  alternative  way  to  solve  the  same 

parameter  estimation  problem  discussed  in  Section  IV. 3. 

IV.  4  State  Space  Approach:  All  Pole  Coefficients  as 

State  Vectors 

In  Section  IV. 3,  g  and  £_  were  assumed  to  be  known 
and  estimating  a  by  maximizing  p(als^)  led  to  solving  a 
set  of  linear  equations.  By  representing  the  model  of 
speech  in  a  state  space  form,  the  same  solution  can  be 
obtained  in  a  recursive  manner  by  a  Kalman  filter.  In 
Section  IV. 4.1,  the  properties  of  a  Kalman  filter 
relevant  to  our  discussions  in  this  thesis  are  briefly 
summarized.  In  Section  IV. 4. 2,  based  on  the  properties 
of  a  Kalman  filter  discussed  in  Section  IV. 4.1,  it  is 
disscussed  that  a  Kalman  filter  applied  to  the  proper 
model  of  speech  maximizes  pCa's^). 

IV. 4.1  Kalman  Filter:  Review 

Suppose  a  system  can  be  represented  by  a  state 
equation  of  the  following: 


x in)  =  F (n) -x(n-l) 

+  G  { n )  •  u  ( n ) 

£  ( n )  =  H  ( n )  •  x  ( n )  * 

v(n)  for  0  _<  n  <  > 

i-1 

(4-23 

where  x(n)  is  a  state  vector, 
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z_(n)  is  an  observation  vector, 

u(n)  is  a  vector  of  zero  mean  white  Gaussian  noise 
with  a  given  covariance  function, 
v(n)  is  a  vector  of  zero  mean  white  Gaussian  noise 
with  a  given  covariance  function  uncorrelated  with 
u(n)  , 

and  x ( — 1 )  is  the  initial  condition  vector  which  is 
Gaussian  with  a  given  mean  and  covariance. 

If  F(n),  G(n)  and  H(n)  are  known,  then  E  f  x  ( n )  1  z_  ( n  ^  ,  0 )  ] 
which  is  the  optimum  under  the  MMSE  criterion  can  be  obtained 
by  a  linear  solution  known  as  the  "Kalman  filter". 

Depending  on  whether  n  is  greater  than,  equal  to,  or 

less  than  n^,  the  solution  is  known  as  a  predictor, 

filter  or  smoother,  respectively.  For  a  Gaussian  x(n) 

which  is  the  case  in  equation  (4-23) ,  the  MMSE  estimator 

is  equivalent  to  the  MAP  estimator  since  p(x(n)  z_(n,  ,  0 )  ) 

is  symmetric  about  the  conditional  expectation  E'x(n)  £(n.,C)]. 

u_ 

The  detailed  linear  solutions  of  a  Kalman  filter  and  its 
properties  can  be  found  in  [22,31,32,33,34]. 


s  (n) 


(4—24' 
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Equation  (4-24)  is  a  special  case  of  equation  (4-23) .  If 
g  and  s  are  assumed  known,  then  F (n) ,  G(n)  and  H(n)  are 
completely  specified.  Therefore,  E[aj£(N-l,0) ]  which 
corresponds  to  both  the  MMSE  and  MAP  estimates  of  a  can 
be  obtained  by  a  Kalman  filter.  The  filtering  form  (31, 

32]  of  a  Kalman  filter  applied  to  equation (4-24)  is  given 
by  an  iterative  solution; 

d(n+l)  =  a (n)  +  k  (n+1) •  (s  (n+1)  -  s?  (n ,n+l-p) *a (n) )  (4-25) 

where  £(n)  represents  E[a(n) | s (n,0) ]  and  k(n+l)  is  the 
Kalman  filter  gain  which  is  a  function  of  the  covariance 
matrix  of  a(n) .  The  covariance  matrix  of  a(n)  can  also 
be  updated  and  the  initial  starting  values  a(-l)  and  the 
covariance  of  a(-l)  are,  of  course,  the  a  priori  mean  and 
covariance  of  a.  For  each  n,  a(n)  obtained  in  this 
manner  is  identical  to  a  estimated  by  minimizing  the 
function 

"V  *  l  (s(m)  -  aT*s (m-l,m-p) ) 2  +  (a-a) T*pI1  (a-a) 
g  m=0  ~  u 

(4-26) 

In  particular,  a(N-l)  is  the  estimate  of  a  obtained  by 
minimizing  equation  (4-19)  with  respect  to  a.  The  filtering 
form  of  the  Kalman  filter  solution  discussed  above  is  also 
known  as  a  recursive  least  squares  procedure  and  the  primary 


advantage  of  a  recursive  solution  is  that  the  data  can 
be  sequentially  processed  as  they  appear. 


tluJ  - 
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CHAPTER  V  STATISTICAL  PARAMETER  ESTIMATION  FROM 

NOISY  SPEECH 

V.l  Introduction 

In  chapter  iv,  a  framework  was  established  for  the 
MAP  parameter  estimation  of  the  noise-free  speech.  In 
two  of  its  forms,  leading  to  equations  (4-12)  and  (4-15) , 
there  has  been  extensive  experience  in  the  context  of 
the  linear  prediction  speech  analysis  with  considerable 
success  and  are  currently  the  basis  for  many  speech 
processing  systems  [7,8,12,14,24,25,29].  It  is  well 
known,  however,  that  these  procedures  degrade  quickly  in 
the  presence  of  additive  background  noise  [2,3].  Conse¬ 
quently,  it  is  of  interest  to  consider  whether  the  same 
basic  approach  and  philosophy  can  be  applied  when  the 
observations  are  recognized  to  be  corrupted  by  the  back¬ 
ground  noise.  Thus,  in  this  chapter,  we  consider  the 
statistical  parameter  estimation  from  the  noisy  speech 
based  on  the  MAP  estimation  procedure. 

In  our  discussions  in  this  chapter,  we  first  consider 
the  case  of  the  white  Gaussian  background  noise  and  then 
extend  the  theoretical  results  obtained  to  a  more  general 
case  when  the  background  noise  is  colored.  In  Section 
V.2,  the  MAP  estimation  procedure  that  maximizes  the 
probability  density  function  of  the  parameters  to  be 
estimated  conditioned  on  the  noisy  speech  vector  will 
be  shown  to  be  a  non-linear  problem.  In  Section  V.3,  we 
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develop  a  linear  iterative  algorithm  which  approximates 
the  MAP  estimation  procedure.  In  Section  V.4,  we  develop 
another  linear  iterative  algorithm  by  revising  the  method 
discussed  in  Section  V.3.  In  Section  V.5,  we  extend  the 
theoretical  results  discussed  in  Sections  V.2,  V.3,  and 
V.4  to  a  more  general  case  when  the  background  noise  is 
colored.  In  Section  V.6,  we  relate  the  two  linear  itera¬ 
tive  algorithms  to  the  MAP  estimation  procedure. 

V.2  MAP  Estimation  Procedure:  A  Non-linear  Problem 

Speech  is  again  assumed  to  be  generated  by  the  model 
of  equation  (4-1)  and  the  coefficient  vector  a  are  the 
basic  parameters  to  be  estimated.  The  observation  vector 
y(N-l,0)  which  will  alternatively  be  denoted  as  y^ 
consists  of  the  sum  of  the  speech  and  background  noise, 
i  .e .  , 


y  (N-l , 0)  =  s(N-l,0)  +  d (N-l , 0 )  (5-1) 

where  d(n)  is  zero  mean  white  Gaussian  background  noise 
,  2 

with  variance  of  and  is  assumed  to  be  uncorrelated 
with  s (n) . 

Following  a  procedure  similar  to  that  of  case  4 
(Section  IV. 2. 4),  we  can  consider  choosing  the  parameters 
a  to  maximize  P^ajy^).  In  Chapter  IV  when  we  assumed  that 
g  and  s_  were  known  and  p(a)  was  Gaussian,  the  resulting 
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equations  were  linear.  For  the  current  situation,  this 
will  no  longer  be  the  case.  Specifically,  from  equations 


(4-1)  and  (5-1), 

T 

y(n)  *  a  *£(n-l,n-p)  +  g*w(n)  +  d(n)  (5-2) 

T  T 

or  y(n)  =  a  *^(n-l,n-p)  +  g*w(n)  +  d(n)  -  a  *d(n-l,n-p) 

(5-3) 


Expressing  p(y  l^g,^)  in  a  manner  similar  to  equation 
(4-4)  , 


N-l 

n 


n=p 


n=i 


p (y (n)  |a,g,sI,2.(n-l,0) ) 
p(y(n)  la^s^y^n-^O)  ) 
p (y (0) | a,g,sx) 


(5-4) 


From  equation  (5-2),  for  n  >  p,  p (y (n)  | a ,g ,£x  ,^(n-l , 0 ) ) 

.  ,  T  , 

is  Gaussian  with  mean  of  a  *E [s (n-l ,n-p) | a,g , ,£(n-l , 0) ] 
and  variance  of  g  +  c^  +  a  *Var  [s(n-l,n-p)  |a,g,sx,y_(n-l,0)  ]  *a 
where  E(s (n-l,n-p) |a,g,s  ,£(n-l,0) ]  and  Var (s (n-l ,n-p) 

|a,g,sx ,^(n-l,0) ]  denote  the  mean  and  covariance  of 
s(n-l,n-p)  conditioned  on  a,g,s  and  v(n-l,0).  Since  the 
variance  is  a  function  of  a,  and  will  likewise  be  so  for 
the  remaining  terms,  the  resulting  equations  for  maximizing 


p (a  1 )  will  by  necessity  be  non-linear. 

Even  though  we  have  only  shown  that  maximizing 
p (a  S^q)  which  corresponds  to  case  4  in  Chapter  IV  is 
a  non-linear  problem,  it  is  easy  to  see  that  maximizing 
9(a.,q,s1\y0)l  pU.gl^)  P<*'»IIX0)  corresponding  to 
cases  1,  2  and  3  in  the  previous  chapter  is  also  a  non¬ 
linear  problem.  This  is  partly  because  each  of  the  three 
density  functions  p  (a,g, Sj  Ij^)  ,  p(a,g|yQ)»  or  P^SjIXq) 
is  a  product  of  several  terms,  one  of  which  is 

N-l 

p(y (n)  [a^Sj^Cn-lfO)  )  . 

n=p 

It  was  shown  above  that  p(y(n)  |a,g, Sj^Cn-l , 0)  )  for 
p  <_  n  <  N-l  has  the  variance  which  is  a  function  of  a. 

V.  3  Maximization  of  p(£'5olXo):  Linearized  MAP  (LMAP) 
Estimation  Procedure 

To  maximize  p(a|^)  which  was  shown  to  be  a  non¬ 
linear  problem  in  Section  V.2,  one  approach  is  to  determine 
pfal^)  for  any  set  of  specific  a  and  then  use  some  form 
of  hill  searching  algorithm  [35,36,37].  In  general, 
solving  such  a  non-linear  problem  is  computationally 
undesirable.  Thus,  we  are  led  to  consider  another  method 
which  has  a  linear  implementation,  but  which  may  not  be 
optimum  in  the  sense  that  ptaly^)  is  not  maximized.  In 
Chapter  IV,  we  have  seen  that  maximizing  pUjs^  is  in 


general  a  non-linear  problem.  However,  by  incorporating 
some  auxilliary  parameters  as  parameters  for  estimation 
and/or  making  some  specific  assumptions  on  the  a  priori 
knowledge  of  the  parameters,  the  resulting  equations  can 
be  made  linear.  When  the  resulting  equations  (4-12  ,  4-15 
4-20,  4-22)  are  used  to  estimate  a  and  speech  is  syn¬ 
thesized  based  on  the  estimated  a,  experience  [7,8,24,25] 
has  shown  that  intelligible  speech  with  high  quality  can 
be  generated.  Motivated  by  the  apparent  success  in  the 
case  of  noise- free  speech,  we  take  a  similar  approach  in 
the  case  of  noisy  speech.  More  specifically,  we  assume 
that  g  and  sx  are  known,  and  include  the  speech  vector 
Sq  as  an  additional  parameter  to  be  estimated.  Thus  we 
maximize  P  (a,  s^  |  jointly  with  respect  to  a  and  Sg. 

In  this  section,  we  show  that  maximizing  oCa,^!^)  is 
still  a  non-linear  problem  but  can  be  implemented  by  a 
linear  iterative  procedure. 

V.3.1  An  Algorithm  to  Maximize  P^SqI^) 

Suppose  we  begin  with  an  assumed  set  of  initial 

z  : - 

A  linear  implementation  for  a  can  also  be  obtained 
essentially  in  a  parallel  manner  by  maximizing  pCafS^g, 

SjIXq),  P(a,s<),g|y0)  or  p (a,^,^ j with  the  appro¬ 
priate  a  priori  density  assumptions  of  the  unknown  para¬ 
meters.  This  situation  is  analogous  to  the  four  cases 
considered  in  Chapter  IV  and  allow  us  to  estimate  the 
other  parameters  (g,Sj)  in  the  same  manner  as  a  if  such 

an  approach  is  desired.  In  the  discussions  in  this  chapte 
we  concentrate  primarily  on  maximizing  pCa^s^)^). 


1  i 
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values  for  the  coefficient  vector  a  and  based  on  this, 
estimate  s^  by  maximizing  .  Denoting  this 

first  estimate  of  s^  by  s^,  we  then  form  a  first  estimate 
a^  of  a.  This  procedure  can  then  be  continued  iteratively 
to  obtain  the  final  estimate  a  of  the  coefficients.  We 

—  oo 

now  show  that  this  procedure  for  estimating  a  (and  s^) 
always  increases  P^SqI^q)  at  each  iteration  unless 
a  converging  solution  is  obtained.  Specifically,  since 
JL  is  obtained  by  maximizing  p(a|sgj_)r 


pfiilioi'ZoJ-ptioi1^-  p(ii-ill0i'^o) 

•  P^JZo5 


(5-5a) 


and  therefore 


P(£i'loiIZ0)  >  P<£i-1'!0i,*0)  (5'5b) 

The  equality  sign  in  equation  (5-5b)  holds  only  if  a^=a^_^ 
since  p(a|sQ,y^)  is  Gaussian  in  a.  Since  s^  is  obtained  by 
maximizing  P (Sg l£i.1,y0) , 


(5-6a) 


and  therefore 


i 


•  + 
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P'ii-i'WZo’  i  <5-«b> 

The  equality  sign  in  equation  (5-6b)  holds  only  if 
«oi  *  loi-i  since  _  (s^ja,^)  is  Gaussian  in  s^.  From 
equations  (5-5b)  and  (5-6b) , 

i  P'ii-1'loi-ilZo)  (5-7! 

in  which  the  equality  sign  holds  if  and 

Equation  (5-7)  shows  that  the  iterative 
procedure  discussed  above  always  increases  P(a/S0|y<-))  at 
each  iteration  unless  a  converging  solution  is  reached. 

If  the  initial  guess  for  a  and  the  shape  of  pCa,^!^)  is 
such  that  this  procedure  converges  to  the  global  maximum, 
then  this  procedure  will  in  fact  correspond  to  that 
joint  MAP  estimate  of  the  parameters  a  and  s^.  Thus, 
in  essence,  this  attempt  to  simplify  the  problem  computa¬ 
tionally  corresponds  to  augmenting  the  desired  set  of 
parameters  a  with  the  additional  parameters  Sq. 

V.3.2  Maximization  of  pCs^ja,^) 

From  the  discussions  in  Chapter  IV,  maximizing 
P(^.ISq'Zq)  which  is  equivalent  to  maximizing  pfajs^) 
requires  the  solution  of  a  set  of  p  linear  equations  for 
a.  To  show  that  the  algorithm  requires  solving  oniy 
linear  equations,  we  now  show  that  maximizing  p(sQia,y0) 


can  be  denoted  as 
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is  also  a  linear  problem. 

From  Bayes'  rule,  P(s0|a,y0) 

P  (®o  I  — ) 

P(»oli-2o>  -  P(Xoli.io)-  p(g|a)  (5-8l 

Denoting  pf^la,^)  by 

N-l 

=  TT  p(y (n)  |a,so,^(n-l,0) ) 

n=l 

•  p {y (0) | a , Sq)  (5-9) 

and  noting  that  p(y(n)  |a, s. ,y (n-l, 0) )  is  Gaussian  with 

— 

2 

mean  of  s(n)  and  variance  of  a.  for  1  <  n  <  N-l  and 

a  —  — 

p(y(0)  la,^)  is  Gaussian  with  mean  of  s(0)  and  variance 
2 

of  ad,  P(y0|a,s<3)  can  be  denoted  as 

1  l  N-l  , 

P(y0l-,^0)  =  - 2  N/2  *  exp(--±j  *  I  (y (n) -s (n) )  ;  (5-10) 

(2iTad)‘/  2ad  n=0 

Combining  equations  (4-6)  and  (5-10)  with  equation  (5-8)  with 
the  assumption  that  g  and  s^.  are  given  and  noting  that 
p(yQ|a)  is  not  a  function  of  s^, 

P<50I*'Z0)  =  COnStant  '  771-  ^-2\-N/2  *  eXP(_I  £o}  (5'ila) 

( 4  7t  •  g  *cd)  - 

and  i  N-l  N-l 

3  2  '  l  (s(n)-a  *s(n-l,n-p))  +  — =- •  £  (y(n)-s(n)7 

v  g  n*0  q  n=0 

a  (5-llb) 


*  / 
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I 


Maximizing  pCs^la,^)  is  equivalent  to  minimizing  in 
equation  (5-llb)  and  thus  we  choose  that  satisfy  the 
set  of  linear  equations. 


3s 


3s(i) 


?r-  *  0  for  i  =  0,1,2, ....  ,N-1 


(5-12) 


A  closed  form  expression  for  the  solution  of  equation 
(5-12)  can  be  obtained  by  representing  the  speech  model 
with  equation  (3-6) .  From  equation  (5-1) , 


p  (y (N-l, 0)  |a,s  (N-1,0)  )  =  p(y_(N-l,0)  |s(N-l,0)) 

»  N(s(N-l,0)  ,a2-I)  (5-13a) 

From  equation  (3-6e)  with  u(n)=g*w(n), 

p ( s (N-l , 0 )  | a)  »  N( (I-A) ■1-AI-sI  ,  g2 • (I-A) "X  •  ( (I-A) _1) T) 

(5-13b) 


We  now  combine  equations  (5-13)  with  equation  (5-8) 
assuming  that  g  and  s_  are  given  and  noting  that  p(^[a)  is 
not  a  function  of  s^.  The  result  is  that 

p(s (N-1,0)  ja,^)  =  N  (  ( Rg 1  +  -if-  •  I)  _1-  Zo+R s1’^)  ' 

a  a 

(R_1  +  -4--I)”1)  ( 5-14a) 

s  c 
Cd 
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v 


I. 


where 

m  =  (I-A) -1 -A* sT  and  R  -  g2  •  ( I-A) _1  •  ( ( I-A) ~L) T 

( 5  — 1 4b ) 


Therefore,  maximizing  pts^la,^)  is  equivalent  to  estima¬ 
ting  Sq  by 


io  = 


(Rs1+ 


4  D"1-*-3? 

ad  °d 


(5-15) 


An  alternative  way  to  maximize  P(Sq|s./Xq)  is  from 
the  smoothing  form  [33,34]  of  a  Kalman  filter.  As  we 
discussed  in  Section  III. 3,  equation  (3-13)  of  the 
noisy  speech  model  can  be  represented  in  the  form  of 
equation  (3-7)  with  x(n) ,  F(n),  G(n),  u(n)  ,  z  (n)  ,  H(n) 
and  v(n)  given  by  equation  (3-9) .  As  we  discussed  in 
Section  IV. 4.1,  it  is  well  known  that  for  equation  (3-7) 
with  zero  mean  white  Gaussian  u(n)  and  v(n)  uncorrelated 
with  each  other  (this  corresponds  to  equation  (4-23)), 
the  smoothing  form  of  a  Kalman  filter  leads  to 
E (x (n) | £ (N-l, 0) ,  F(n)]  for  n=0 , 1 , 2 , . . . . , N-l  which  corres¬ 
ponds  to  EfSgla,^].  Since  pts^la,^)  is  jointly  Gaussian, 
Efs^ja,^]  is  also  the  MAP  estimate  of  s^  that  maximizes 
p(s0|a./y0).  The  Kalman  filtering  approach  has  an  advan¬ 
tage  in  that  only  pxp  matrix  (the  stats  x(n)  has  p  elements) 
operations  are  required  while  equation  (5-15)  requires 
NxN  matrix  operations. 


n 


V.3.3  Linearized  MAP  Estimation  Procedure 


Summarizing  the  steps  involved  in  the  linear  imple¬ 
mentation  method,  we  have 


Step  1:  Begin  with  cL  ,  the  ith  estimate  of  a. 

Step  2:  Obtain  the  i+lst  estimate  of  s^,  by 

solving  equation  (5-12) ,  from  equation  (5-15) , 
or  from  the  smoothing  form  of  a  Kalman  filter. 

Step  3:  Obtain  aL+^,  the  i+lst  estimate  of  a,  by  minimiz¬ 
ing  equation  (4-19)  with  obtained  in 
Step  2. 


The  above  steps  complete  one  iteration  and  the  procedure 
can  be  continued  for  as  many  desirable  number  of  iterations. 
The  initial  estimate  a^  may  be  obtained  by  simply  applying 
the  correlation  method  of  the  linear  prediction  analysis 
to  We'll  refer  to  this  algorithm  as  the  "Linearized 

MAP"  (LMAP)  estimation  procedure. 

In  our  discussions  so  far,  we  have  assumed  that  g 
and  s^.  are  known.  Even  though  these  parameters  are  not 
known  exactly,  we  might  expect  to  make  some  reasonable 
guess  of  g  and  s__ .  For  example,  in  the  LMAP  estimation 
procedure,  for  each  iteration  when  Step  2  is  completed, 
we  have  an  estimate  of  s^.  Before  going  to  Step  3,  we 
could  maximize  p  (a  ,g  ,s_  1  s^)  that  leads  to  equations  (4-7) 
and  (4-9)  from  which  g  and  £  can  be  estimated.  Then  we 
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can  assume  that  these  estimates  are  exact  and  use  them 

in  step  3  in  the  current  iteration  and  step  2  of  the 

next  iteration.  Another  possibility  for  estimation  g 

and  Sj  is  to  jointly  estimate  a,g  and  s.  in  step  3  from 

Sq  estimated  in  step  2  with  the  assumption  of  no  a 

priori  information  of  g  and  a  general  Gaussian  density 

assumption  of  a  and  s^.  An  example  of  p(s_  )  could  be 
2 

N(£(-l,-p),  <j  I).  In  Section  IV. 3.1,  it  was  shown 
that  p(a_, g,^.  |s^)  could  be  maximized  by  solving  a  set  of 
linear  equations  if  no  a  priori  information  of  a,g  and 
s  is  available.  When  a  priori  information  of  a  and 
s^  is  available,  jointly  maximizing  p  ( a_, g ,  )  is  a 

non-linear  problem.  However  we  can  again  solve  iteratively 
by  maximizing  P(a,  s^lg,^)  with  respect  to  a  and  £  and 
then  maximizing  p (g | a ,g, s^, s^)  with  respect  to  g  for  each 
iteration.  Maximizing  p(a.fSI|g,s^)  again  involves  an 
iterative  procedure  in  which  p  (a  |  s_r  ,g ,  s^)  is  maximized 
with  respect  to  a  and  then  P  (£.j  I  a  ,g ,  Sq)  is  maximized  with 
respect  to  s^  for  each  iteration.  It  can  be  shown7  that 
the  above  procedure  never  decreases  p (a ,g , s^ | s^)  at  each 
iteration.  Maximizing  pfajs^g,^)  ,  p (Sj  | a ,g , s^)  ,  or 

g 

p(g|a,sI,s0)  involves  solving  a  set  of  linear  equations. 

7This  statement  can  be  proved  in  an  analogous  manner  as 
in  equations  (5-5),  (5-6)  and  (5-7). 

g 

The  derivations  are  similar  to  the  derivations  in  the 
four  cases  (Sections  IV. 3.1,  IV. 3. 2,  IV. 3. 3,  and  IV. 3. 4)  and 
they  begin  from  equation  (4-6). 


A  third  possibility  for  or  g  is  simply  to  assume  that 
=  0.  as  we  did  in  case  2  (Section  IV. 3. 2)  which  led  to 
the  correlation  method  of  the  linear  prediction  analysis, 
and  estimate  g  from  the  energy  considerations  which  will 
be  discussed  further  in  Chapter  VI. 

The  discussions  so  far  were  based  on  the  assumption 
that  the  primary  interest  is  in  the  estimation  of  a.  It 
is  important  to-  note,  however,  that  the  LMAP  estimation 
procedure  estimates  s^  in  the  process  of  estimating  a  by 

/V  /V 

Sq  *  E  [ ]  -  Sq  estimated  in  this  manner  can  be 
directly  used  as  enhanced  speech.  Therefore  the  LMAP 
algorithm  discussed  in  this  section  can  be  used  not  only 
for  the  bandwidth  compression  but  also  for  the  enhancement 
of  noisy  speech. 

V . 4  Revised  Linearized  MAP  (RLMAP)  Estimation  Procedure 

V.4.1  Motivation  for  the  Revision 

A  careful  observation  of  the  LMAP  estimation  proce¬ 
dure  discussed  in  Section  V.3  leads  to  another  estimation 
procedure  that  again  requires  solving  a  set  of  linear 
equations  in  an  iterative  manner.  In  step  2  of  the  LMAP 
estimation  procedure,  we  estimate  by  E[s^]a,y^j.  In 
step  3,  we  note  that  the  MAP  estimate  of  a  corresponding 
to  maximizing  p(a|sg)  uses  the  values  to  form  products 
of  the  form  s(i)*s(j).  Thus  estimating  s^  in  step  2  by 
corrss?onds  to  estimating  s(i)*s(j)  as 


As  an  alternative,  we  can  consider  generating  directly  the 
MMSE  estimate  of  the  product  s(i)*s(j).  Thus  the  estimate 
of  s(i)*s(j)  is  given  by 

A 

s(i)*s(j)  =  E[s  (i)  •  s  ( j)  |  a,^]  (5-17) 

In  this  method,  then,  we  follow  the  same  procedure  as  we 
did  in  the  LMAP  method  with  the  difference  in  that 
s  ( i ) • s  ( j )  is  estimated  by  equation  (5-17)  rather 
than  equation  (5-16) . 

V.4.2  Estimation  of  s(i)*s(j)  by  E [s (i) • s ( j ) | a ,y^] 

In  this  section,  we  show  that  E [ s (i) • s ( j ) | a 

can  be  obtained  by  solving  sets  of  linear  equations. 

From  the  expression  of  P(sQ!a,y0)  in  equation 

(5-11) ,  e  in  equation  (5-llb)  can  be  written  as 
P 

N-l  N-l 

e  *  l  l  S .  .  (s(i)  -  m.  )*(s(j)  -  m.)  +  constant 

P  i-0  j-0  ^  1  3 

(5-18) 

Since  p(s_ia,y_)  is  jointly  Gaussian  in  s_,  [3.  .3**^  is 

*HJ  1.  J 

a  covariance  matrix  for  s^  conditioned  on  a  and  v^  where 
[  3 i j ]  ^  represents  the  inverse  of  a  matrix  whose  ijth 
element  is  3...  Denoting  this  covariance  matrix  by  [y, .] 
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(i.e. ,  CYij]  =  [0ij]"1) , 

j  *  E[  (s(i) -E[s  (i)  )  •  (s(  j) -E[s(j)  la,^] )  Ia/^q] 

*  E[s(i)  *  s  ( j  )  la^J-Elsd)  [a,^]  *E[s(j)  la,^]  (5-19) 

Therefore, 

E[s  (i)  *s  (j)  |a,^]  =  Yij+E[s(i)  la,^]  *E[s(j)  [a,^]  (5-20) 

in  which  [y^j]  is  given  by 

A  closed  form  expression  for  y^_.  and  therefore  for 
E  [s  (i)  •  s  ( j )  |  a,^]  can  be  obtained  by  representing  the 
speech  model  with  equation  (3-6) .  From  equation  (5-14)  , 
is  9iven  by 

P<*ola»Zo)  =  N(£'^  (5-21a) 

in  which 

m  -  (Rg1  +  -^••I)"1'(  “T  *  y0+Rl1'-s)  (5-21b) 

ad  ad 

and 

V  *  [Yij]  =  (R“X  +  \  '  I)”1  (5-21c) 

ad 

where  rr^  and  Rg  are  given  by  equation  (5-14b)  . 

Since 

V  -  EKsQ-ElSQja,]^])  •  (^-EC^Ia,^])7^,^] 


(5-22) 
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-  -  Et^la^l-ECsJla,^], 

E  (Sq*  |  a,^]  ■  V  +  Ets^a,^]  ’ECsJla,^]  »  V+m*mT  (5-23) 

in  which  m  and  V  are  given  by  equation  (5-21) .  Equation 
(5-23)  is  a  closed  form  expression  for  E(s(i)  *s(j)  la,^] 
for  0<i,  j<N-l. 

An  alternative  way  to  obtain  Y^j  in  equation  (5-20) 
is  by  representing  the  noisy  speech  model  with  equation 
(3-13).  When  u(n)=g-w(n),  equation  (3-13)  is  a  special 
case  of  equation  (4-23)  .  Then  from  the  smoothing  form  of 
a  Kalman  filter,  we  can  obtain  the  covariance  function 
of  the  states  conditioned  on  all  the  observations  and 
known  matrices  such  as  F(n),  which  in  our  case  directly 
leads  to  y  —  •  The  Kalman  filtering  approach  has  an  advan¬ 
tage  in  that  only  pxp  matrix  operations  are  required  while 
equation  (5-23)  requires  NxN  matrix  operations. 


*  * 


V.4.3  RLMAP  Estimation  Procedure 


Summarizing  the  steps  involved  in  the  linear  imple 


mentation. 


Step  1:  Begin  with  a^,  the  ith  estimate  of  a. 

Step  2:  A.  Obtain  the  i+lst  estimate  of 

s^,  by  solving  equation  (5-12) ,  from 
equation  (5-15)  or  from  the  smoothing 
form  of  a  Kalman  filter. 

B.  Obtain  8^  from  equation  (5-18)  and  y  ^  • 
from  [y^j]  =  tS^j]  1,  or  obtain  y^ 
from  equation  (5-21c) ,  or  from  the 
smoothing  form  of  a  Kalman  filter. 

C.  Estimate  s(i)*s(j)  from  equation  (5-20) 

with  the  results  obtained  in  the  steps 

T 

A.  and  B.  above,  or  estimate  s^* s^  from 
Equation  (5-23)  . 

Step  3:  Obtain  eL+^,  the  i+lst  estimate  of  a,  by 

minimizing  equation  (4-19)  with  s(i)*s(j) 

T 

or  s^'Sq  obtained  in  Step  2. 


The  above  steps  can  be  continued  for  as  many 
desirable  iterations.  The  initial  estimate  can  be 
obtained  by  simply  applying  the  correlation  method  of  the 
linear  prediction  analysis  to  v  .  Like  the  LMAP  case, 
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there  are  a  number  of  ways  of  obtaining  g  and  s_  which 
are  the  assumed  known  variables  in  the  algorithm.  The 
possible  methods  discussed  in  Section  V.3  are  equally 
applicable  to  the  algorithm  discussed  in  this  section. 

We'll  refer  to  this  algorithm  as  the  Revised  Linearized 
MAP"  (RLMAP)  estimation  procedure. 

To  emphasize  the  difference  between  the  LMAP  and 

RLMAP  algorithms,  a  block  diagram  that  represents  one 

iteration  of  the  two  algorithms  is  shown  in  Figure  5.1. 

The  only  difference  between  the  two  algorithms  is  an 

T 

additional  term  V  in  estimating  Sq’Sq  in  *he  RLMAP 
algorithm.  Compared  with  the  LMAP  algorithm  discussed 
in  Section  V.3,  the  RLMAP  algorithm  is  computationally 
less  tractable.  As  will  be  discussed  in  Chapter  VI, 
however,  when  N  is  assumed  to  approach  °°,  the  RLMAP 
algorithm  is  slightly  more  complex  in  its  computation 
than  the  LMAP  algorithm.  In  the  RLMAP  algorithm,  there 
are  at  least  two  ways  Sq  can  be  estimated.  One  way  is 
to  use  obtained  in  Step  2A.  This  is  equivalent  to 
estimating  s^  by  EtSgla,^].  Alternatively,  Sq  can  be 
estimated  by  forming  $  ( n >  from  s(i)*s(j)  and  assuming 
some  phase  of  Sg.  The  estimated  can  be  used  as  enhanced 
speech  if  speech  enhancement  is  desired. 


V.5  Extension  to  Colored  Background  Noise  Case 

Our  discussions  in  Sections  V.2,  V.3  and  V.4  are 


Figure  5.1  One  iteration  of  LMAP  and  RLMAP  algorithms 
m  and  V  are  given  by  equation  (5-21)  in  the  text. 
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based  on  white  Gaussian  noise  as  the  additive  background 

noise.  In  this  section,  the  theoretical  results  are 

extended  to  the  case  when  the  background  noise  is  Gaussian 

but  colored.  When  the  background  noise  is  colored,  all 

the  discussions  in  the  previous  three  sections  remain 

unchanged  except  that  estimating  Sq*Sq  by  Efs^la,^] 

T  T 

•Ets^la/^Q]  or  E [Sq • | a should  again  be  shown  to 
be  a  linear  problem. 

From  equation  (3-6e)  with  u(n)  =  g*w(n),  equation 
(5-14)  can  be  easily  generalized  as 

p (s (N-1,0) | a,  Yq) 

=  N(  (R^+R"1)-1.  (Rj1-^  +  (Rs1+Rd1)-1)  (5-24 ) 

in  which 

m  =  (I  -  A)  1*A_*s 
— s  i  —I 

2  -1  -IT 

Rg  -  g  •  (I  -  A)  •  (  (I  -  A)  A) 

Rd  =  E[d(N-l,0) -dT(N-l,0) ] 

which  is  obtained  from  the  assumed  known  statistics  of  d(n), 
and  A  and  A^  are  defined  in  equation  (3-6)  . 

Equation  (5-24)  can  be  used  to  show  that  estimating 
Sq  by  E  (Sq  |  a  and  s^sj  by  ECs^la,^]  are  still 
linear  problems  since 

EiSoia.^]  =  of1  * 


(5-25) 
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and 


Et«0-j5la,Xol  =  (R^1 


«Sl>‘l 


+  ‘ECs^la,^]  (5-26) 

V.6  Relationship  Among  Maximization  of  pCaly^),  LMAP , 
and  RLMAP  Algorithms 

The  LMAP  and  RLMAP  algorithms  have  been  developed 
in  this  chapter  by  attempting  to  suboptimally  maximize 
p(a|xo) •  Some  recent  theoretical  work  by  Musicus  [38] 
carried  out  in  parallel  with  this  dissertation  shows 
that  a  close  relationship  exists  among  the  LMAP  and  RLMAP 
algorithms  and  the  problem  of  maximizing  Ptaly^).  More 
specifically,  suppose  that  g  is  known  and  s_  »0_.  Represent¬ 
ing  pfal^)  by  f (a)  *exp (g (a) )  ,  the  LMAP  and  RLMAP  algorithms 
increase  a (a)  and  p(a|yQ)  respectively  at  each  iteration 
unless  a  converging  solution  is  reached.  Therefore  if 
g  is  assumed  known  and  s^  is  assumed  to  be  0_,  then  the 
RLMAP  algorithm  is  one  way  to  maximize  p(ajy^).  Further 
theoretical  work  related  to  the  above  discussions  is 
currently  under  way  and  will  be  reported  by  Musicus  [38] . 


J 
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CHAPTER  VI  IMPLEMENTATION:  THREE  NOISE 
REDUCTION  SYSTEMS 

VI . 1  Introduction 

In  this  chapter,  three  noise  reduction  systems  that 
are  implemented  and  evaluated  are  discussed.  Two  systems 
discussed  in  Sections  VI. 2  and  VI. 3  are  derived  by 
approximating  the  LMAP  and  RLMAP  algorithms  discussed  in 
Chapter  V.  Even  though  the  LMAP  and  RLMAP  algorithms 
require  solving  only  sets  of  linear  equations  or  imple¬ 
menting  a  Kalman  filter,  some  approximations  lead  to 
computationally  simpler  systems  by  making  use  of  an  FFT 
algorithm.  In  Section  VI. 4,  a  speech  enhancement  system 
discussed  in  Section  II. 2. 6  is  summarized.  The  primary 
purpose  of  implementing  this  system  is  to  compare  it  with 
the  other  two  systems  discussed  in  Sections  VI . 2  and  VI . 3 . 
Since  the  system  summarized  in  Section  VI . 4  is  probably 
as  good  in  its  performance  as  any  other  speech  enhance¬ 
ment  system  summarized  in  Chapter  II,  such  a  comparison 
can  provide  an  indication  of  the  performance  of  the  two 
systems  derived  from  the  theoretical  framework  of  this 
dissertation  relative  to  other  speech  enhancement  systems 
previously  proposed.  The  results  of  the  evaluation  of 
the  three  systems  will  be  presented  in  Chapters  VII  and 


VIII. 


•JL.  s 


\ 
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VI. 2  System  A 

For  each  iteration  of  the  LMAP  algorithm  discussed 
in  Chapter  V,  it  is  in  general  necessary  to  solve  a  set 
of  p  linear  equations  to  estimate  a  from  Sg  and  N  linear 
equations  to  estimate  s^  from  a  and  Since  N  in 

general  is  in  the  order  of  several  hundred  for  a  typical 
application  in  speech,  solving  a  set  of  N  linear  equations 
simultaneously  can  be  computationally  tedius.  Thus  we 
develop  a  procedure  that  approximates  solving  the  set 
of  N  linear  equations. 

From  equations  (5-llb)  and  (5-12) , 

2  p  p 

— y(i)=s(i)-  X  a.  *s(i-k)  -  \  av-s(i+k) 

od  k=l  *  k=l  K 

+  ?  f  a.  *aff  *s  (i+k-Jl)  +  ’s(i) 

k=l  Jl=l  K  *  (Jj 

d 

for  0  <  i  <  N-p-1  (6-la) 


2  p  N-l-i 

*y(i)  =  s  (i)  -  X  a.  ’S(i-k)  -  l  a,  *s(i+k) 

a,  k=l  K  k=l 

d 


+  l  I  a.  *a.  ’S  (i+k-2.)  +  %r-  *s(i) 

k=l  is i  *  *  at 

d 


with  £(N+p-3,N)  =  0_  for  N-p<i^_N-2  (6.1b) 


and 


V  -yti) 


s  (i) 


l  a.  •  s  (i-k)+  2_*  s  (i)  for  i=N-l  (6-lc) 

k=l  K  ct 

a 
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Solving  equation  (6-1)  for  in  general  requires 
solving  N  simultaneous  linear  equations.  However,  if  we 
assume  that  £(p-l,0)  is  also  given  as  well  as  s  ,  then 
the  N  equations  do  not  have  to  be  solved  simultaneously. 
More  specifically,  rearranging  equation  (6-la), 


ap*s (i+p) 


s  (i) 


k=l 


ak*s (i-k) 


p-i 

l  a  *s(i+k) 
k-1  K 


k=l  £=1 


a^  *a^ • s (i+k-2) 


+  %£  -  s  ( i )  *y  (i) 
Gd  ad 


for  0  <_  i  <  N-p-1 


(6-2) 


s(i+p)  in  equation  (6-2)  for  0  <_  i  <_  N-p-l  can  be  solved 

individually  if  s_(p-l,0)  is  given  since  the  right  hand 

side  of  equation  (6-2)  involves  terms  of  s(n)  for  n<i+p. 

£(p-l,0),  of  course,  is  not  given,  but  we  could  assume 

s(p-l,0)  =  y(p-l,0).  For  N  sufficiently  large  relative 

to  p ,  we  would  in  general  expect  that  the  effect  of 

a  specific  assumption  of  s_(p-l,0)  is  rather  small. 

In  the  above,  we  have  developed  a  procedure  which 

does  not  require  solving  a  set  of  N  linear  equations 

simultaneously.  However,  solving  for  s^  from  equation 

2 

(6-2)  still  requires  in  the  order  of  N •  p  multiplications. 
Furthermore,  once  s^  is  estimated,  the  correlation 
function  has  to  be  formed  from  .  An  alternative  approach 
which  is  computationally  simpler  and  leads  to  a  system  with 
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a  simple  interpretation  is  to  consider  the  problem  in 
the  frequency  domain.  More  specifically,  z  transforming 
equation  (6-la)  with  the  assumption  that  the  difference 
equation  holds  for  all  i  (i.e.,  N=«) , 


S(u) 


Y(u) 


where 

PSU) 


P  BE 

1-2*  7  a,  -coskw  +  l  l  a.  *a .  *cos  (k-2.)  w 
k=l  K  k=l  Z=1 


(6-4a) 


{ 6  —  4b ) 


Equation  (6-4)  is  a  non-causal  Wiener  filter.  This  result 
is  quite  reasonable  since  it  is  well  known  that  when 
y  (n) =s (n) +d (n)  where  s(n)  is  uncorrelated  with  d(n)  and 
the  power  spectral  densities  of  s (n)  and  d(n)  are  known, 
the  MMSE  estimate  of  s (n)  from  y(n)  can  be  obtained  by 
a  Wiener  filter.  For  this  reason,  then,  for  a  more  general 
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case  when  the  background  noise  is  colored,  the  procedure 
for  obtaining  s(n)  by  equation  (5-20)  is  equivalent  to 
estimating  s(n)  by  filtering  y(n)  with  a  linear,  time 
invariant  filter  with  the  frequency  response  given  by 


H<w) 


P  (gj)  +  P  ,  (u) 

s  a 


(6-5) 


and  P^(to)  represents  the  power  spectral  density  of  the 
background  noise  and  Ps  ( oj )  represents  the  power  spectral 
density  of  speech  given  by  equation  (6-4b) . 

Theoretically,  the  non-causal  Wiener  filter  requires 
an  infinite  amount  of  data.  In  practice,  we  have  only  N 
points  of  data  that  can  be  modelled  as  yw (n) =y (n)  *wg (n) 
where  wg (n)  represents  a  sufficiently  smooth  analysis 
window  over  the  effective  length  of  h(n).  For  a  sufficient¬ 
ly  large  N  and  small  effective  length  of  h(n)  relative 
to  N, 

(y(n)*w  (n))  *  h(n)  =  (y(n)  *  h(n))*ws<n) 

=  s (n)  *w  (n)  (6-6a) 

s 

and  therefore  v  (n)  *  h(n)  s  s  (n)  (6-6b) 

*  w  w 

Based  on  equation  (6-6b)  ,  s  (n)  i-s  estimated  by 
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s  (n)  =  y  (n)  *  h(n)  (6-6c) 

w  w 

We  expect  that  the  approximation  given  by  equation  (6 -6b) 
is  not  good  for  n  close  to  0  or  N-l  but  is  adequate  for 
0<<n<<N-l.  For  a  sufficiently  large  N,  it  is  expected 
that  the  poor  approximation  at  the  edges  of  the  window 
do  not  have  a  large  effect.  From  equation  (6-6c) , 


S  («)  =  Y  (to  >  *H  (w)  (6-7a) 

w  w 


and 


!S~(u)|2  =1  Vw(aj)  |  2*  |  H(oj)  |  2 


( 6 — 7b ) 


Now  if  equation  (4-22)  is  used  rather  than  equation  (4-19) 
in  step  2  of  the  LMAP  algorithm,  the  function  that  is 
directly  used  in  minimizing  in  equation  (4-22)  can 
be  expressed  as 


<^s (n)  =  l  Sw(i)  -Sw(i-n) 


l  =  -00 


Then  s'' (n)  and  <J>~(n)  can  be  obtained  by  inverse  Fourier 

W  5 

2 

transforming  S~(oj)  and  |S^(ai)|  ,  i.e., 


sf(n)  =  F_1  [s"  (oj)  ]  =  F_1[Y  (u)  *H(w)  ]  (6-8a) 

w  w  w 


4>g(n)  =  F*1  [  \  S~  (u>)  \  2]  =  F-1[  jYw(u)  j2*  {  H  (  uj  )  j2]  (6-8b) 
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Denoting  the  M  point  Discrete  Fourier  Transform  (DFT) 
of  a  sequence  x(n)  by  X(k), 


X(k) 


M-l 

l  x(n) *e 
n=0 


.  2it  , 

rk'n 


X  (oj  ) 


Since  x(n)  is  related  [39]  to  the  Inverse  Discrete  Fourier 
Transform  (IDFT)  of  X(k)  by 


T  x(n+k-M)  =  IDFT [X (k) ]  , 

k=-00 


=  IDFT [Y  (k) -H(k) ]  (6-9a) 

w 

=  IDFT[ | Yw ( k ) |2- | H (k) |2] (6-9b) 

A  A 

For  a  finite  effective  length  of  s  (n)  and  <$  (n)  and  for 

w  s 

a  sufficiently  large  M, 

l  s  (n+k-M)  =  s  (n) 
k=-°° 

and 

« 

l  d  (n+k*M)  =  A  (n) 
k=-*>  s  3 

With  this  assumption,  we  estimate  sr  (n)  and  s  (n)  by 

w  s 


equation  (6-8)  leads  to 


w  A  ^ 

l  s  (n+k *M)  =  IDFT [ S  (k) ] 
~  w  w 


w  /V 

l  <t>  (n+k-M)  =  IDFTU  (k)  ] 
k=-°°  s 


✓ 
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s  (n)  =  IDFT  (Y  (k)-H(k))  (6-10a) 

w  w 

<Dg(n)  =  IDFT  (jYw(k)  |2*  |H(k)  |2)  (6-10b) 

A 

s  (n)  in  equation  (6-10a)  can  be  used  as  enhanced  speech. 

W 

/V 

$  (n)  in  equation  { 6  — 10b )  can  be  used  to  estimate  the 
a  by  minimizing  s  in  equation  (4-22) . 
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approach  00 •  I  thus  reducing  the  minimization  of  equation 
(4-22)  to  the  correlation  method  of  the  linear 
prediction  analysis. 


The  above  steps  complete  one  iteration  and  we'll 
refer  to  the  above  system  as  System  A.  It  is  noted  that 
System  A  can  be  used  to  estimate  sw(n)  as  well  as  a, 
and  that  System  A  does  not  require  an  estimate  of  s^ . 
Further,  it  is  noted  that  the  phase  of  S(w)  estimated 
in  System  A  is  the  same  as  the  phase  of  Y(w).  This  is 
because  the  frequency  response  of  a  non-causal  Wiener 
filter  H  (oj)  is  real  and  positive  and  thus  zero  phase.  In 
various  speech  enhancement  systems  discussed  in  Chapter  II, 
we  have  seen  that  the  phase  of  S(u)  used  is  the  same  as 
the  phase  of  Y(ui). 


VI. 3  System  B 

In  Section  VI. 2,  System  A  was  developed  based  on 
the  LMAP  estimation  procedure  discussed  in  Section  V.3. 
In  this  section  we  develop  a  system  that  is  based  on  the 
RLMAP  estimation  procedure  discussed  in  Section  V.4. 

From  equation  (5-20) ,  the  difference  between  the  LMAP 


and  RLMAP  estimation  procedure  is  the  additional  term  y. 


13 


in  estimating  s(i)*s(j).  The  system  developed  in  this 
section  is  a  modification  of  System  A  that  incorporates 
the  term  y . . . 

13 


From  equations  (6-4b)  and  (6-13) , 


A(cj) 


P  (ui) 

s 


(6-14a) 


and 


p,<")  - 


-  I 

k-1  k 


(6-14b) 


From  equations  (6-12)  and  (6-14) , 


B(ai)  =  A(oj)  +  9  (u)  and  therefore. 


B(“'  *  +  -f 

s  ad 


(6-15) 


Since  y .  .  -  [B. • 1 and  B. .  depends  on  the  time  difference 

1 j  1 J 

i-j,  representing  y_  by  y  (n)  =  y(i-j)  =  y± ^  , 


y  (n)  *  3 (n)  »  5 (n)  and  therefore, 


r(w)  •  b(u)  =  i 


(6-16) 


From  equations 

r  (u)  = 


(6-15)  and  (6-16), 


P 

s 


(oj)  *ad 
(w ) +ad 


(6-17) 


and  P  ('*>)  is  given  by  equation  ( 6  —  1 4 b )  . 

Since  [9_]  ^  represents  the  covariance  matrix  cf 
the  background  noise,  for  M  approaching  9  (  j)  *1/Pd  (*j) 
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l  H Csw ( i)  la,^]  *E[swti-n)  |a,^] 

i  =  -00 


by  tp  (n)  in  equation  (6- 8b)  ,  and  [  w  (i)  *w  (i-n) 

3  i=-oo  s  s 

00 

by  1  ws  (i)  for  0  <_  n  <_  p, 
i=-“ 


*s  (n) 


f"1[|yw(o))  I2 


Pe(u)  2 

.  / _ 3 _ i 

(oj)+P.  Wi  1 
s  d 


l  w2(i) 


Pg  (w)  *  Pd(oj) 

PB  (<d)  +  pT(ai) 

s  d 


for  0<n<p 


(6-21) 


In  an  analogous  manner  as  equation  (6-10)  was  obtained  from 
equation  (6-8),  $  (n)  is  estimated  from  equation  (6-21)  by 


Vnl 


*  IDFT [ | Y  (k) 


PsOO 


P  (k) +P ,  (k) 
s  d 


l  =  -oo 


W2(i) 


P  (k) • P  .  (k) 
s  d 

P  (k) +P . (k) 
s  d 


for  0<n<p 


(6-22) 


and  Pg(ui)  is  given  by  equation  ( 6  —  1 4 b )  .  Equation  (6-22) 

can  be  used  in  minimizing  s  in  eauation  (4-22). 

P 

Now  we  summarize  the  specific  algorithm  that  has  been 
implemented  and  evaluated. 


Step  0: 

Step  1 
Step  2 


Step  3 


Obtain  a  ^  by  the  correlation  method  of  the 
linear  prediction  analysis  assuming  s^Cn)  = 

A 

Begin  from  a^,  the  ith  estimate  of  a. 

A.  Estimate  g  by  an  energy  measurement; 


d  (n 


=  I  it  (n)  “  l  (n)  *° 

n  w  n  s 

A 

where  a  corresponds  to  a.  . 

B.  Estimate  ( n)  by 


2 

d 


oo  P  (k)  *P  (k) 

IDFT[|Y(k)  | 2  *  1 H (K)  |2  +  l  w^(i).pS(k)+p^;k) ] 

i=— oo  s '  d ' 


where  H(oj)  = 


Vu) 


PH  (u>)+P.(w)  ' 

s  d 


P  (w)  = 

s 


11-  !  ak.e-5kV 

k=l  K 


and  a  corresponds  to  a . .  If  s  (n)  is  desired 
—  — i  w 

and  if  it  is  the  last  iteration  to  be  performed, 

estimate  s  (n)  by  IDFT[Y  (k)'K(k)]. 
w  w 

With  the  first  p+1  points  of  i  (n) ,  and  a  and 
given  by  the  available  a  priori  knowledge 


of  a,  estimate  a. 


by  minimizing  equation  (4-22) 


VI. 4  System  C 

This  system  is  based  on  a  speech  enhancement  method 
discussed  in  Section  II.  2. 6.  The  specific  algorithm 
implemented  and  evaluated  is  given  below. 

2 

Step  1:  Estimate  [S  (w) |  by 

lO> 12  =  IV“>  i2  -  k*EC IVu)  I2] 

for  |Yw(u) |2  >  k-E[ |Dw(u) i2] 

0  otherwise 

for  some  constant  k.  S  (oj)  ,  Y  (jj)  and  D  (j) 

w  w  w 

represent  the  Fourier  Transform  of  the  windowed 
segment  of  speech,  noisy  speech,  and  noise 
respectively . 

Step  2:  Obtain  g  (n)  by  IDFT  [  I S  (k)|2].  If  s  (n)  is 

s  w  w 

desired,  then  s  (n)  is  estimated  from  :  S  (lo)  I 
w  w 

in  Step  1  and  the  ohase  of  Y  (j) . 

w 

Estimate  a  by  minimizing  e  in  ecuation  (4-22) 

-  P 

with  the  first  p+1  points  of  igtn)  obtained  in 


Step  3: 
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We'll  refer  to  the  above  system  as  System  C. 
Compared  with  System  A  or  System  B,  System  C  is  computa¬ 
tionally  simpler.  It  is  also  noted  that  when  k=0  in 
System  C  and  no  a  priori  information  is  assumed,  it 
corresponds  to  estimating  a  by  the  correlation  method 
of  the  linear  prediction  analysis  with  the  assumption  of 
sw(n)=yw(n)  . 
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CHAPTER  VII  EXAMPLES  AND  ILLUSTRATIONS 
VII. 1  Introduction 

The  three  systems  developed  in  Chapter  VI  have 
been  applied  to  both  synthetic  and  real  speech  data  at 
various  S/N  ratios  and  in  this  chapter  a  few  examples  are 
illustrated.  In  Section  VII. 2,  examples  in  which  the 
systems  are  applied  to  synthetic  data  are  illustrated. 

In  Section  VII . 3 ,  examples  in  which  the  systems  are 
applied  to  real  speech  data  are  illustrated.  In  all 
the  examples  considered  in  this  chapter ,  noisy  data  are 
generated  by  adding  zero  mean  white  Gaussian  background 

noise  and  the  S/N  in  dB  is  defined  as  10*log(£s  (n)/ 

2  n 
Id  (n) )  where  the  summation  is  over  the  length  of  the 

analysis  segment.  In  all  the  figures  in  which  a  time 

waveform  is  displayed,  the  duration  is  25.6  msec.  In 

all  the  figures  in  which  the  log  magnitude  spectrum  is 

displayed,  the  range  is  approximately  50  dB  and  the 

angular  frequency  is  between  0  and  -  that  corresponds  to 

the  analog  frequency  between  0  and  5  kHz  at  10  kHz 

sampling  rate. 

VII. 2  Application  to  Synthetic  Data 

The  synthetic  data  used  in  the  examples  are  based  on 
a  10  kHz  sampling  rate  and  are  generated  by  exciting  a 
tenth  order  all  pole  filter  whose  coefficients  were  derived 
from  segments  of  real  speech  data.  The  excitation  was 
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chosen  in  one  set  of  examples  to  be  white  Gaussian 
noise  and  in  the  other  set  of  examples  to  be  a  periodic 
impulse  train.  As  we  discussed  in  Chapter  III,  all  the 
theoretical  results  in  Chapters  IV,  V  and  VI  were  derived 
assuming  a  stochastic  excitation.  For  speech  without 
background  noise,  systems  derived  from  this  point  of  view 
have  empirically  been  shown  to  perform  well  even  when  the 
excitation  is  a  periodic  impulse  train  and  it  will  be 
seen  in  this  chapter  and  Chapter  VIII  that  this  statement 
generally  applies  to  the  three  systems  under  consideration. 

In  the  examples  considered  in  this  section,  the  analysis 
is  based  on  256  synthetic  data  points,  the  order  of  the 
all  pole  system  is  assumed  to  be  10,  and  the  S/N  ratios 
considered  are  20  dB,  10  dB  and  0  dB. 

In  Sections  VII. 2.1,  VII. 2. 2  and  VII. 2. 3,  the  perfor¬ 
mance  of  the  three  systems  are  discussed  and  illustrated 
individually  based  on  one  specific  synthetic  data  segment 
and  then  later  a  few  more  examples  are  illustrated.  The 
synthetic  data  used  in  Sections  VII. 2.1,  VII. 2. 2  and  VII. 2. 3 
are  shown  in  Figures  7.1  and  7.2.  In  Figure  7.1(a) 
is  shown  the  synthetic  data  when  the  excitation  is  random 
noise.  In  Figure  7.1(b)  is  shown  the  log  magnitude  spectrum 
of  the  data  in  Figure  7.1(a)  and  a  tenth  order  all  pole 
fit  to  the  spectrum  by  the  correlation  method  of  the  linear 
prediction  analysis.  In  Figure  7.1(c)  is  shown  the  synthetic 
data  generated  by  the  same  all  pole  coefficients  as  in 
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(c )  ( d) 


Figure  7 . 1  (a)  Synthetic  data  segment  with  random  noise 

excitation;  (b)  Log  magnitude  of  the  spectrum  of  the 
synthetic  data  in  (a)  and  an  all  pole  fit  to  the  spectrum 
by  the  correlation  method  of  the  linear  prediction  analysis 

(c)  Same  as  (a)  with  a  pulse  train  excitation; 

(d)  Same  as  (b)  with  a  pulse  train  excitation 


Figure  7.2  (a)  Log  magnitude  spectrum  of  the  synthetic 

data  in  Figure  7.1(a)  and  the  transfer  function  that 
corresponds  to  the  known  all  pole  coefficients; 

(b)  Comparison  of  the  transfer  functions  that  correspond 

to  the  known  all  pole  coefficients  in  (a)  and  the  estimated 
all  pole  coefficients  in  Figure  7.1(b); 

(c)  Same  as  (a)  with  a  pulse  train  excitation; 

(d)  Same  as  (b)  with  a  pulse  train  excitation 


Figure  7.1(a)  but  the  excitation  is  now  a  train  of  pulses 
whose  fundamental  frequency  is  150  Hz  typical  of  an 
adult  male  speech.  In  Figure  7.1(d)  is  illustrated  a 
tenth  order  all  pole  fit  to  the  spectrum  by  the  correla¬ 
tion  method  of  the  linear  prediction  analysis.  Since  the 
data  used  are  synthetic,  the  all  pole  coefficients  from 
which  the  synthetic  data  were  generated  are  known.  In 
Figure  7.2(a)  is  shown  the  log  magnitude  spectrum  of  the 
synthetic  data  in  Figure  7.1(a)  and  the  transfer  function 
that  corresponds  to  the  known  all  pole  coefficients.  In 
Figure  7.2(b)  is  shown  the  two  transfer  functions  that 
correspond  to  the  known  all  pole  coefficients  and  the  all 
pole  coefficients  estimated  from  the  synthetic  data  by 
the  correlation  method  of  the  linear  prediction  analysis. 
Figures  7.2(c)  and  7.2(d)  are  equivalent  to  Figures  7.2(a) 
and  7.2(b)  with  the  difference  in  that  the  excitation  is 
a  train  of  pulses. 

VII. 2.1  Application  of  System  A  to  Synthetic  Data 

In  Figure  7.3  is  shown  the  results  of  the  analysis 
based  on  System  A  as  a  function  of  the  number  of  iterations 
when  the  S/N  ratio  is  20  dB  and  the  excitation  is  random 
noise.  More  specifically,  in  Figure  7.3(a)  is  shown  the 
all  pole  fit  to  the  noisy  synthetic  data  by  the  correlation 
method  of  the  linear  prediction  analysis  with  the  assumption 
that  s  (n)=y  (n),  i.e.  zeroth  iteration.  Figures  7.3(b), 


Figure  7 . 3  Comparison  of  System  A 

(a)  Log  magnitude  spectrum  of  the  synthetic  data  in 

Figure  7.1(a)  and  an  all  pole  fit  to  the  noisy  data  spectrum 
after  the  zeroth  iteration  of  System  A  at  S/N  =  20  dB; 

(b)  Same  as  (a)  after  the  first  iteration  of  System  A; 

(c)  Same  as  (a)  after  the  second  iteration  of  System  A; 

(d)  Same  as  (a)  after  the  third  iteration  of  System  A 


(c)  and  (d)  represent  the  transfer  functions  obtained  by 
applying  System  A  to  the  noisy  synthetic  data  after  one, 
two  and  three  iterations,  respectively.  In  each  of  the 
four  figures  ( (a) , (b) , (c)  and  (d) ) ,  the  true  log  magnitude 
spectrum  corresponding  to  the  excitation  of  random  noise 
is  also  shown  to  facilitate  the  comparisons.  Figure  7.4 
is  the  same  as  Figure  7.3  with  the  difference  in  that  the 
excitation  is  a  train  of  pulses.  Figures  7.5  and  7.6 
are  the  same  as  Figures  7.3  and  7.4  with  the 
difference  in  that  the  S/N  ratio  is  10  dB.  Figures  7.7 
and  7.8  are  the  same  as  Figures  7.3  and  7.4  with  the 
difference  in  that  the  S/N  ratio  is  0  dB.  In  all  the 
Figures  7.3  through  7.3,  the  analysis  is  based  on  the 
assumption  that  no  a  priori  information  of  the  coefficient 
vector  is  available.  From  the  figures,  it  can  be  observed 
that  for  the  three  3/N  ratios  considered  a  good  fit  to 
the  true  log  magnitude  spectrum  can  be  obtained  after  two 
iterations  of  System  A,  It  is  also  observed  that  the 
performance  of  the  system  when  applied  to  the  synthetic 
data  generated  by  an  excitation  of  a  train  of  pulses  is 
similar  to  the  case  of  the  random  noise  excitation. 

From  the  theoretical  point  of  view,  it  is  expected 
that  a  converging  solution  after  many  iterations  is  more 
desirable.  In  general,  however,  it  has  been  observed  that 
the  converging  solution  of  System  A  generates  the  transfer 
function  for  which  the  bandwidths  of  the  poles  are  smaller 
than  those  associated  with  real  speech.  Such  a  phenomenon 


Figure  7.4  Same  as  Figure  7.3  with  the  synthetic  data  of 
Figure  7.1(c) 
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can  be  observed  by  the  general  trend  of  the  estimated 
transfer  functions  shown  in  Figures  7.3  through  7.8 
as  the  number  of  iterations  increases.  Thus,  in  the 
actual  implementation  of  System  A,  it  seems  desirable 
to  limit  the  number  of  iterations  to  two. 

VII. 2. 2  Application  of  System  B  to  Synthetic  Data 

In  Figure  7.9  is  shown  the  results  of  the  analysis 
based  on  System  3  as  a  function  of  the  number  of  iterations 
when  the  S/N  ratio  is  20  dB  and  the  excitation  is  random 
noise.  More  specifically,  in  Figure  7.9(a)  is  shown  the 
all  pole  fit  to  the  noisy  synthetic  data  by  the  correlation 
method  of  the  linear  prediction  analysis  with  the  assumption 
that  sw(n)=yw(n),  i.e.  zeroth  iteration.  Figures  7.9(b), 

(c)  and  (d)  represent  the  estimated  transfer  functions 
obtained  by  applying  System  3  to  the  noisy  synthetic 
data  after  two,  five  and  ten  iterations,  respectively.  In 
each  of  the  four  figures  ((a) ,  (b) ,  (c)  and  (d) ) ,  the  true 

log  magnitude  spectrum  corresponding  to  the  excitation  of 
random  noise  is  also  shown  to  facilitate  the  comparisons. 
Figure  7.10  is  the  same  as  Figure  7.9  with  the  difference 
in  that  the  excitation  is  a  train  of  pulses.  Figures 
7.11  and  7.12  are  the  same  as  Figures  7.9  and  7.10  with 
the  difference  in  that  the  S/N  ratio  is  10  dB .  Figures 
7.13  and  7.14  are  the  same  as  Figures  7.9  and  7.10  with 
the  difference  in  that  the  S/N  ratio  is  0  dB .  Again  the 


Figure  7.9  Comparison  of  System  B 

(a)  Log  magnitude  spectrum  of  the  synthetic  data  in 

Figure  7.1(a)  (random  noise  excitation)  and  an  all  pole  fit 
to  the  noisy  data  spectrum  after  the  zeroth  iteration  of 
System  3  at  S/N  =  20  dB; 

(b)  Same  as  (a)  after  the  second  iteration  of  System  3; 

(c)  Same  as  (a)  after  the  fifth  iteration  of  System  B; 

(d)  Same  as  (a)  after  the  tenth  iteration  of  System  B 


r_jc:ure — 7.10  Same  as  Figure  7.9  with  the  synthetic  data 
Figure  7.1(c)  (pulse  train  excitation) 


analysis  used  in  Figures  7.9  through  7.14  is  based  on 
the  assumption  that  no  a  priori  knowledge  of  the  coeffi¬ 
cients  is  available. 

From  the  figures,  it  can  be  observed  that  for  the 
S/N  ratios  considered,  a  good  fit  to  the  true  spectrum 
can  be  obtained  after  five  or  more  iterations  of  System  B. 
It  can  also  be  observed  that  the  performance  of  the 
system  is  similar  to  both  cases  of  excitation,  i.e.  random 
noise  and  a  train  of  pulses  even  though  the  system  was 
developed  based  on  the  assumption  of  the  random  noise 
excitation . 

It  is  not  theoretically  known  if  System  B  converges 
to  a  solution.  In  all  the  synthetic  and  real  speech  data 
that  have  been  considered,  however,  it  has  been  observed 
that  System  B  appears  to  converge  and  the  estimate  after 
many  iterations  in  general  fits  better  to  the  true  log 
magnitude  spectrum  than  the  estimate  obtained  after  a 
few  iterations.  It  has  also  been  observed  that  the 
results  after  ten  iterations  correspond  reasonably  well  to 
the  final  estimate. 

VII.  2.3  Application  of  System  C  to  Synthetic  Data 

In  Figure  7.15  is  shown  the  results  of  the  analysis 
based  on  System  C  as  a  function  of  the  scaling  constant 
"k",  a  parameter  of  System  C,  when  the  S/N  ratio  is  20 
dB  and  the  excitation  is  random  noise.  More  specifically, 


Figure  7.15  Comparison  of  System  C 

(a)  Log  magnitude  spectrum  of  the  synthetic  data  in 

Figure  7.1(a)  (random  noise  excitation)  and  an  all  pole  fit 
to  the  noisy  synthetic  data  with  k=0  of  System  C  at  S/N  =  20 

(b)  Same  as  (a)  with  k=l  of  System  C; 

(c)  Same  as  (a)  with  k=2  of  System  C; 

(d)  Same  as  (a)  with  k=3  of  System  C 
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in  Figure  7.15(a)  is  shown  the  all  pole  fit  to  the  log 
magnitude  spectrum  of  the  noisy  synthetic  data  by  the 
correlation  method  of  the  linear  prediction  analysis  with 
the  assumption  that  sw(n)=yw(n).  Figures  7.15(b),  (c) 
and  (d)  represent  the  estimated  transfer  functions  obtained 
by  applying  System  C  to  the  noisy  synthetic  data  at 
k=l,2,  and  3  respectively.  In  each  of  the  four  figures 
((a),  (b) ,  (c) ,  (d) ) ,  the  true  log  magnitude  spectrum 

corresponding  to  the  excitation  of  random  noise  is  shown 
to  facilitate  the  comparisons.  Figure  7.16  is  the  same  as 
Figure  7.15  with  the  difference  in  that  the  excitation 
is  a  train  of  pulses.  Figures  7.17  and  7.18  are  the  same 
as  Figures  7.15  and  7.16  with  the  difference  in  that  the 
S/N  ratio  is  10  dB.  Figures  7.19  and  7.20  are  the  same 
as  Figures  7.15  and  7.16  with  the  difference  in  that  the 
S/N  ratio  is  0  dB.  In  all  the  Figures  7.15  through  7.20, 
the  analysis  is  based  on  the  assumption  that  no  a  priori 
information  of  the  coefficient  vector  is  available. 

From  the  figures,  it  can  be  observed  that  for  the 
S/N  ratios  considered  a  good  fit  to  the  true  log  magnitude 
spectrum  can  be  obtained  when  k=2  in  System  C.  It  is 
also  observed  that  the  performance  of  the  system  is  similar 
in  both  cases  of  excitation,  i.e.  random  noise  and  a 
train  of  pulses. 

When  k  equals  zero.  System  C  corresponds  to  the  corre¬ 
lation  method  of  the  linear  prediction  analysis  that  dees 


(c) 


(d) 


Figure  7.16  Same  as  Figure  7.15  with  the  synthetic  data 
Figure  7.1(c)  (pulse  train  excitation) 
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no  t  account  for  the  presence  of  background  noise.  Thus, 
the  estimated  transfer  functions  shown  in  (a)  of  Figures 
7.15  through  7.20,  correspond  to  the  case  when  k  equals 
zero.  From  many  examples  of  synthetic  data,  it  has  been 

t 

observed  that  the  performance  of  System  C  in  terms  of  the 
log  magnitude  spectrum  fit  is  poor  when  k  is  greater  than 
3.  It  has  also  been  observed  that  the  log  magnitude 
spectrum  fit  at  k=2  is  generally  better  than  the  fit  when 
k=l  which  corresponds  to  the  correlation  subtraction  method. 

In  the  specific  example  of  the  synthetic  data  that 
has  been  considered  in  Sections  VII. 2.1,  VII. 2. 2  and 
VII. 2. 3,  a  reasonably  good  fit  to  the  log  magnitude 
spectrum  can  be  obtained  by  any  of  the  three  systems 
with  a  proper  choice  of  the  system  parameter  (i.e.  the 
number  of  iterations  for  Systems  A  and  B,  and  the  value 

of  k  for  System  C) .  However,  when  the  noisy  data  have 

no  spectral  peaks  or  spectral  peaks  that  are  different 
from  the  pole  locations  of  the  original  data,  then  the 
application  of  the  three  systems  can  result  in  the  estimated 
transfer  functions  whose  pole  frequencies  are  different 
from  those  of  the  original  data.  This  situation  can  occur 
when  the  overall  S/N  ratio  is  sufficiently  low  in  which 

case  all  the  pole  frequencies  can  be  affected,  or  when 
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the  localJ"l'>  S/N  ratios  near  some  pole  locations  are 
sufficiently  low  in  which  case  the  local  poles  can  be 
affected.  In  Figures  7.21  through  7.24  are  illustrated 
two  such  examples.  In  Figures  7.21(a)  and  (b)  are  shown 
an  example  of  a  segment  of  the  synthetic  data  and  its 
log  magnitude  spectrum.  In  Figures  7.21(c)  and  (d)  are 
illustrated  the  noisy  synthetic  data  at  the  S/N  ratio  of 
-20  dB  and  its  log  magnitude  spectrum.  In  Figure  7.22(a) 
is  illustrated  the  transfer  function  estimated  from  the 
noisy  synthetic  data  in  Figure  7.21(c)  by  the  correlation 
method  of  the  linear  prediction  analysis.  In  Figures 
7.22(b),  (c)  and  (d)  are  shown  the  transfer  functions 

estimated  by  System  A  after  two  iterations,  System  B  after 
ten  iterations  and  System  C  with  k=2.  In  each  of  the 
four  figures  of  Figure  7.22,  the  true  log  magnitude 
spectrum  of  Figure  7.21(b)  is  also  illustrated  to  facili¬ 
tate  the  comparisons.  From  Figure  7.22,  it  is  clear  that 
the  transfer  function  generated  by  any  of  the  three  systems 
does  not  fit  the  true  spectrum  well.  Figures  7.23  and 
7.24  are  equivalent  to  Figures  7.21  and  7.22  with  the 

^The  local  S/N  ratio  between  two  angular  frequencies  uk 
and  is  defined  by  1 

2  1  u-j 

j  !SU)  i2-dw 
U1 

Local  S/N  ratio  in  dB  =  10 • loa  - 

■°2  0 
/  |  D  (  u)  :  2  •  daj 

J1 
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(c)  (d) 


Figure  7.21  (a)  A  synthetic  data  segment; 

(b)  Log  magnitude  spectrum  of  the  synthetic  data  in  (a) ; 

(c)  Noisy  synthetic  data  of  (a)  at  S/N  =  -20  dB; 

(d)  Log  magnitude  spectrum  of  the  noisy  synthetic  data  in  (c) 


Figure  7.22  (a)  Log  magnitude  spectrum  of  the  synthetic 

j  data  in  Figure  7.21(a)  and  an  all  pole  fit  to  the  noisy 

synthetic  data  with  k=0  of  System  C; 

'  (b)  Same  as  (a)  with  two  iterations  of  System  A; 

w  (c)  Same  as  (a)  with  ten  iterations  of  System  B; 

(d)  Same  as  (a)  with  k=2  of  System  C 

1  . 


Figure  7.23  (a)  A  synthetic  data  segment; 

(b)  Log  magnitude  spectrum  of  the  synthetic  data  in  (a) ; 

(c)  Noisy  synthetic  data  of  (a)  at  S/N  =  10  dB; 

(d)  Log  magnitude  spectrum  of  the  noisy  synthetic  data  in  (c) 


Figure  7.24  (a)  Log  magnitude  spectrum  of  the  synthetic 

data  in  Figure  7.23(a)  and  an  all  pole  fit  to  the  noisv 
synthetic  data  with  k=0  of  System  C; 

(b)  Same  as  (a)  with  two  iterations  of  System  A; 

(c)  Same  as  (a)  with  ten  iterations  of  System  B; 

(d)  Same  as  (a)  with  k=2  of  System  C 


difference  in  that  a  different  synthetic  data  segment  is 
used  at  the  S/N  ratio  of  10  dB.  From  Figure  7.24,  it 
is  clear  that  the  lower  formants  where  the  local  S/N 
ratio  is  relatively  high  are  well  recovered  by  the 
three  systems  but  the  performance  is  poor  for  the  higher 
formants  where  the  S/N  ratio  is  relatively  low. 

At  a  high  S/N  ratio,  the  types  of  errors  discussed 
above  do  not  occur  frequently.  As  the  S/N  ratio  decreases 
the  errors  occur  more  frequently  and  eventually  a  point 
is  reached  at  which  the  systems  are  no  longer  useful 
for  the  analysis  of  noisy  speech.  In  Chapter  VIII,  this 
issue  will  be  discussed  in  greater  detail  as  the  perfor¬ 
mance  of  the  three  systems  is  evaluated  by  some  objective 
and  subjective  tests. 

VII. 3  Application  to  Real  Speech  Data 

A  number  of  discussions  in  Section  VII. 2  on  the 
performance  of  the  three  systems  when  applied  to  the  syn¬ 
thetic  data  in  general  also  apply  to  the  real  speech 
data.  Therefore,  only  two  examples  of  real  speech  data 
at  the  S/N  ratio  of  10  dB  will  be  illustrated  primarily 
to  demonstrate  that  the  performance  of  the  systems  when 
applied  to  the  real  speech  data  is  similar  to  the  case 
of  the  synthetic  data.  Again,  the  real  speech  data  are 
based  on  a  10  kHz  sampling  rate,  the  order  of  the  all  pole 
model  is  assumed  to  be  10,  the  analysis  is  based  on  256 
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data  points,  and  no  a  priori  information  of  the  all 
pole  coefficient  vector  is  assumed  to  be  available. 

In  Figures  7.25(a)  and  (b)  are  shown  an  example  of 
a  segment  of  unvoiced  speech  and  its  log  magnitude 
spectrum.  In  Figures  7.25(c)  and  (d)  are  illustrated  the 
noisy  synthetic  data  and  its  log  magnitude  spectrum.  In 
Figure  7.26(a)  is  illustrated  the  transfer  function 
estimated  from  the  noisy  speech  data  in  Figure  7.25(c) 
by  the  correlation  method  of  the  linear  prediction  analysis. 
In  Figures  7.25(b),  (c)  and  (d)  are  shown  the  transfer 
functions  estimated  by  System  A  after  two  iterations. 

System  B  after  ten  iterations  and  System  C  with  k=2. 

In  each  of  the  four  figures  of  Figure  7.26,  the  true 
log  magnitude  spectrum  of  Figure  7.25(b)  is  also  illustrated 
to  facilitate  the  comparisons.  Figures  7.27  and  7.28 
are  equivalent  to  Figures  7.25  and  7.26  with  the  differ¬ 
ence  in  that  a  different  real  speech  data  which  is  voiced 
is  used.  In  the  two  examples  considered,  a  good  fit  to 
the  spectrum  can  be  obtained  by  the  three  systems.  Again 
when  a  sufficiently  large  amount  of  background  noise 
is  added  to  speech,  the  errors  discussed  in  Section  VII. 2 
also  occur.  This  can  be  observed  to  some  extent  for  the 
higher  formants  in  Figure  7.23. 

In  this  chapter,  various  examp les  were  shown  to 
qualitatively  illustrate  the  performance  of  the  three 


systems  when  applied  to  both  synthetic  and  real  speech  data. 


(d) 


( c) 


Figure  7.25  (a)  A  real  data  segment  of  unvoiced  speech; 

(b)  Log  magnitude  spectrum  of  the  real  speech  data  in  (a) ; 

(c)  Noisy  speech  data  of  (a)  at  S/N  =  10  dB; 

(d)  Log  magnitude  spectrum  of  the  noisy  speech  data  in  (c) 


Figure  7.26  (a)  Log  magnitude  spectrum  of  the  real  speech 

data  in  Figure  7.25(a)  and  an  all  pole  fit  to  the  noisy  speech 
data  with  k=0  of  System  C; 

(b)  Same  as  (a)  with  two  iterations  of  System  A; 

(c)  Same  as  (a)  with  ten  iterations  of  System  B; 

(d)  Same  as  (a)  with  k=2  of  System  C 


Figure  7.28  (a)  Log  magnitude  spectrum  of  the  real  speec 

data  in  Figure  7.27(a)  and  an  all  pole  fit  to  the  noisy 
speech  data  with  k=0  of  System  C; 

(b)  Same  as  (a)  with  two  iterations  of  System  A; 

(c)  Same  as  (a)  with  ten  iterations  of  System  B; 

(d)  Same  as  (a)  with  k=2  of  System  C 
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In  Chapter  VIII,  a  more  detailed  and  quantitative  discus¬ 
sion  on  the  performance  of  the  three  systems  will  be 
presented  based  on  some  objective  and  subjective  tests. 
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CHAPTER  VIII  EVALUATION 
VIII. 1  Introduction 

In  this  chapter,  the  performance  of  the  three 
systems  developed  in  Chapter  VI  is  discussed  in  greater 
detail  and  more  quantitatively  based  on  some  objective  and 
subjective  tests.  Even  though  the  theoretical  results  can 
be  applied  to  colored  noise  as  well  as  white  noise,  the 
background  noise  considered  here  is  white  Gaussian 
background  noise.  In  Sect'ion  VIII.  2,  the  results  of 
an  objective  test  are  discussed.  In  the  objective  test, 
the  synthetic  data  are  generated  from  the  known  all  pole 
coefficients  and  the  estimated  all  pole  coefficients  by 
the  three  systems  are  compared  with  the  known  all  pole 
coefficients  under  a  reasonable  criterion.  In  Section 
VIII. 3,  the  results  of  a  subjective  test  to  evaluate 
the  three  systems  as  analvsis/svnthesis  systems  (potential 
bandwidth  compression  systems)  of  noisy  speech  are  discussed 
If  the  estimated  speech  parameters  are  properly  coded, 
then  they  would  correspond  to  true  bandwidth  compression 
systems.  In  Section  VIII. 4,  the  three  systems  are  evaluated 
as  speech  enhancement  systems.  In  Section  VIII. 5,  seme 
additional  studies  are  discussed,  in  which  a  complete 
analysis/synthesis  system  is  used  as  input  to  a  channel 
vocoder.  In  Section  VIII. 6,  the  main  results  obtained  in 
Chapter  VIII  are  summarized. 
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VIII. 2  Objective  Evaluation 

In  this  section,  we  discuss  the  performance  of  the 
three  systems  developed  in  Chapter  VI  based  on  an  objective 
criterion.  In  Section  VIII. 2.1,  the  systems  and  their 
parameters  that  are  objectively  evaluated  are  listed. 

In  Section  VIII. 2. 2,  we  describe  the  objective  criterion 
used  for  the  system  evaluation.  In  Section  VIII. 2. 3, 
we  discuss  how  the  all  pole  coefficients  are  obtained  to 
generate  synthetic  data.  In  Section  VIII. 2. 4,  we  describe 
how  the  synthetic  data  generated  are  used  to  obtain  a 
measure  that  leads  to  the  system  evaluation  under  the 
objective  criterion  disucssed  in  Section  VIII. 2. 2. 

In  Section  VIII. 2. 5,  we  discuss  the  results  of  the  objective 
evaluation . 

VIII. 2.1  Systems  Evaluated 

All  three  systems  discussed  in  Chapter  VII  are 
evaluated  for  three  cases  per  system.  System  A  is  evalua¬ 
ted  baseu  on  the  results  obtained  after  one,  two  and 
three  iterations.  System  B  is  evaluated  based  on  the 
results  obtained  after  two,  five  and  ten  iterations. 

System  C  is  evaluated  for  the  cases  when  k=l,2,  and  3. 

The  above  nine  cases  are  compared  with  each  other  and 
with  the  case  of  System  C  when  k=0  which  corresponds  to 
the  conventional  linear  prediction  analysis.  In  all 

is  assumed  that  the  a  oriori  information  of  the 


cases,  it 


all  pole  coefficient  vector  is  not  available. 


The  systems  and  their  parameters  for  which  the 
objective  test  is  performed  are  summarized  in  Table  8.1. 


VIII. 2. 2  Objective  Criterion 

One  measurement  is  made  for  the  objective  evaluation 
The  measurement  made  is  LMSE  which  represents  the 
square  error  of  the  log  magnitude  spectrum.  More  speci¬ 
fically,  LMSE  is  given  by 
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and  a  and  a  represent  the  known  all  pole  coefficients  from 
which  the  synthetic  data  are  generated  and  the  estimated 
all  pole  coefficients  by  one  of  the  ten  cases  listed  in 
Table  8.1.  .  In  evaluating  LMSE  in  equation  (8-la),  the 
integral  is  replaced  by  a  summation  by  sampling  at  to=^-  k 
where  M=  512.  The  M  used  here  is  the  same  M  in  equation 
(8-la) .  Thus  LMSE  is  evaluated  by 
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Table  8.1 

Systems  Evaluated  under  an  Objective  Criterion 


Cases 

System 

Parameters 

A  Priori  In 

A- 1 

A 

one  iteration 

none 

A- 2 

A 

two  iterations 

none 

A-3 

A 

three  iterations 

none 

3-2 

B 

two  iterations 

none 

B-5 

B 

five  iterations 

none 

B- 10 

3 

ten  iterations 

none 

C-l 

C 

k=l 

none 

C-2 

C 

k=2 

none 

C-3 

k  =  3 

none 

C-0 

c 

k=0 

none 
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The  criterion  used  here  for  the  performance  evaluation  is 
based  on  the  studies  [7]  that  indicate  that  the  square 
error  of  the  log  magnitude  spectrum  reflects  reasonably 
well  the  degradation  of  the  perceptually  important  aspects 
of  speech. 

In  addition  to  the  LMSE  measure,  another  measurement, 
LCSE,  which  represents  the  LPC  Coefficient  Square  Error 
was  also  made.  LCSE  is  defined  as 

,  o  - 

LCSE  =  ^  (a. -a. )  (8-2) 

i  =  l  1 

and  a  and  a  represent  the  known  all  pole  coefficients  from 
which  the  synthetic  data  are  generated  and  the  estimated 
all  pole  coefficients  by  one  of  the  ten  cases  listed  in 
Table  8.1.  The  results  based  on  this  measure  will  not  be 
used  for  the  system  performance  evaluation  in  the  context 
of  speech  analysis.  However,  LCSE  is  an  interesting 
quantity  in  that  the  ail  pole  coefficients  are  the  paramete 
that  are  directly  estimated  m  the  systems  developed  in 
this  thesis.  The  results  based  on  LCSE  are  summarited  in 
Appendix  2 . 

VIII.  2.  3  Generation  of  All  Pole  Coefficients 

The  fallowing  two  steps  are  used  to  obtain  one 
hundred  sets  of  the  ail  pole  coefficients  that  are  used 
for  generating  synthetic  data.  The  first  step  involves 
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generating  a  tenth  order  all  pole  function  in  the  form 
of 


H(z)  =  -|r - - -  (8-3) 

TT  (1-bk-z'1’ (i-bk-rl» 

k=l 

where  is  chosen  randomly  from  within  a  circle  with 
the  radius  of  0.98  in  the  z  plane  with  equal  a  priori 
probability  for  each  point  in  the  circle.  The  second 
step  involves  generating  the  synthetic  data  of  256  points 
long  by  exciting  H(z)  in  equation  (8-3)  with  white 
Gaussian  noise  and  then  estimating  the  all  pole  coefficients 
based  on  the  synthetic  data  by  the  correlation  method  of 
the  linear  prediction  analysis.  In  generating  the  all 
pole  coefficients,  the  second  step  was  necessary  since 
some  all  pole  coefficients  generated  by  the  first  step 
alone  were  quite  large  in  their  magnitudes  (sometimes 
greater  than  20)  and  the  error  measurement  LCSE  in  equation 
(8-2)  was  dominated  by  the  error  due  to  a  few  such 
coefficients.  It  was  found  that  the  second  step  essentially 
forced  the  magnitudes  of  all  the  all  pole  coefficients 
generated  to  be  less  than  4  without  significantly  changing 
the  locations  and  bandwidths  of  the  poles  generated  by 
the  first  step.  One  hundred  sets  of  tenth  order  all  pole 
coefficients  were  obtained  by  the  above  two  step  procedure 
and  were  used  in  generating  the  synthetic  data  for  the 
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objective  evaluation. 


VIII. 2. 4  Data  Acquisition,  Analysis  and  Results 
Based  on  the  one  hundred  all  pole  transfer  functions 
obtained  in  the  manner  discussed  in  Section  VIII. 2. 3, 
two  hundred  sequences  were  generated,  one  hundred 
sequences  by  exciting  with  zero  mean  white  Gaussian  noise 
and  the  remaining  one  hundred  sequences  by  exciting 
with  a  train  of  pulses  with  the  pulse  spacing  that 
corresponds  to  the  fundamental  frequency  of  150  Hz. 

Then  for  each  of  the  two  hundred  sequences,  noisy  synthe¬ 
tic  data  were  generated  by  adding  zero  mean  white  Gaussian 
background  noise  at  the  S/N  ratios  of  -20,  0,  10,  20,  and 
40  dB.  For  each  sequence  of  the  noisy  synthetic  data 
(one  thousand  sequences),  the  ten  systems  in  Table  VIII. 1 
were  used  to  estimate  the  all  pole  coefficients.  They 
were  then  compared  with  the  known  all  pole  coefficients 
from  which  the  synthetic  data  were  generated.  Thus  LMSE 
in  equation  (8-1)  and  LCSE  in  equation  (8-2)  were  obtained 
for  each  of  the  one  hundred  sets  of  known  all  pole 
coefficients  as  a  function  of  the  system  type  (ten 
cases  in  Table  8.1),  the  type  of  excitation  (random  noise 
or  a  pulse  train)  and  the  S/N  ratio  (-20,  0,  10,  20,  and 
40  d3) .  For  notational  convenience,  we  denote  LMSE 


(a  ,  S.  ,  E .  ,  R,  )  and  LCSE  (a  ,  S .  ,  E .  ,  R.  )  to  represent 
-n  i  ]  k  — n  i  j  k  r 

LMSE  and  LCSE  that  correscond  to  a  ,  S. ,  E.  and  R,  ,  where 

— n  i  j  .< 


represents  the  nth  set  of  the  all  pole  coefficients 


and  thus  l<n<100. 


represents  the  ith  system  in  Table  8.1  and 


thus  l<i<_10, 

E.  represents  the  excitation  type  with  E1  and  E 
3  12 

corresponding  to  random  noise  and  a  pulse  train 
respectively , 

and  represents  the  kth  S/N  ratio  with  R1,R2,R3,R4  and 

R^  corresponding  to  -20,  0,  10,  20,  and  40  dB 
respectively . 


Using  this  notation,  we  define  LMSE  and  LCSE  by 


100 


.)  (8-4) 


_  1  w  w 

LMSE  (Si  ,  E  .  /  R^)  •  igj-  •  Z  LMSEU  S  ,E  ,1^1 
J  n=l  J 

_  .  100 

LCSE(Si,E.,Rk)  =  •  l  LCSE(an,Si,Ej,Rk)  (8-5) 


From  equations  (8-4)  and  (8-5),  LMSE  and  LCSE  represent 
the  mean  LMSE  and  LCSE  averaged  over  the  one  hundred  sets 
of  the  all  pole  coefficients  obtained  in  Section  VIII. 2. 3 
as  a  function  of  the  system  type,  excitation  type  and  S/N 
ratio.  LMSE  obtained  in  this  manner  is  tabulated  in  Table 
8.2  and  figures  based  on  Table  8.2  are  illustrated  in 
Section  VIII. 2.5  where  we  discuss  the  performance  of 
different  systems  under  the  objective  criterion.  LCSE 
is  tabulated  in  Appendix  2.  In  Table  8.2  is  also  shown 
the  normalized  LMSE  which  is  defined  by 


1 
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No  rmali  zed  LMSE  (S. 


i'VV  = 


LMSE  (S  .  ,E  .  ,R,  ] 
_ i  I  K 


(8-6) 


LMSE  ( S  i  o  >  E  •  / 


Since  S^Q  corresponds  to  the  conventional  linear  prediction 


analysis  that  does  not  account  for  the  presence  of  noise, 


Normalized  LMSE (S . ,E . ,R,  )  smaller  than  1  indicates  the 

i  3  K 


improvement  of  System  S^  over  System  S1Q.  Normalized 
LCSE  (S  .  ,  E  .  ,  R,  )  defined  in  an  analogous  manner  as  in 

i  3  K 


equation  (8-6)  is  also  tabluated  in  Appendix  2. 


VIII. 2. 5  Discussions 


In  Figure  8.1  is  shown  LMSE (S .  ,E . ,R,  )  for  l<i<3 

13k  - 


that  corresponds  to  System  A,  l_<j<2  and  l<k£ 5  .  In 
Figure  8.1(a)  is  illustrated  the  case  of  j=l  that  corres¬ 
ponds  to  random  noise  excitation  and  in  Figure  8.1(b) 
is  shown  the  case  of  j=2  that  corresponds  to  the  case  of 


the  pulse  train  excitation.  In  the  figures,  LMSE (S1 n , E . , R,  ) 

JL  U  3  ^ 


is  also  shown  by  a  solid  line  to  facilitate  the  comparison 
in  terms  of  improvement  over  the  conventional  linear 
prediction  analysis.  From  Figure  8.1,  the  following 
points  are  noted.  First,  System  A  is  capable  of  performing 
better  than  the  conventional  linear  prediction  analysis 
for  a  wide  range  of  S/N  ratios.  Second,  System  A  shows 
a  better  performance  after  two  iterations  than  after  one 
iteration  or  three  iterations  at  the  S/M  ratios  above 


-10  dB .  This  result  is  consistent  with  our  observations  in 


comparison  of  System  A  based  on  LMSE 
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Chapter  VII.  Third,  System  A  degrades  quickly  below 
0  dB  of  S/N  ratio  and  eventually  performs  worse  than  the 
conventional  linear  prediction  analysis.  Therefore,  -10 
dB  seems  to  be  the  lowest  S/N  ratio  at  which  System  A 
shows  some  improvement  over  the  conventional  linear  predic¬ 
tion  analysis.  Fourth,  even  though  there  are  detailed 
quantitative  differences,  qualitatively  speaking,  the 
performance  of  System  A  is  essentially  the  same  for  both 
types  of  excitations  which  are  consistent  with  our  obser¬ 
vations  in  Chapter  VII. 

Figure  8.2  is  essentially  the  same  as  Figure  8.1 
with  the  difference  in  that  LMSE  is  plotted  to  determine 
the  performance  of  System  B.  The  three  systems  plotted 
are  B-2,  B-5  and  B-10  listed  in  Table  8.1.  From  Figure 
8.2,  the  following  points  are  noted.  First,  System  B 
is  capable  of  performing  better  than  the  conventional 
linear  prediction  analysis  for  a  wide  range  of  S/N  ratios. 
Second,  System  3  performs  better  after  more  iterations 
are  carried  out.  Therefore,  it  appears  that  the  converging 
solution  is  the  optimum  under  the  objective  criterion. 

This  is  consistent  with  our  observations  in  Chapter  VII. 

It  also  appears  that  the  results  after  ten  iterations  are 
reasonably  close  to  the  converging  solution.  In  Figure 


8.2(a)  is  plotted  a  point  (x)  at  the  S/N  ratio  of  10  dB 
after  20  iterations  and  it  is  slightly  better  than  the 
results  after  10  iterations.  Third,  System  B  degrades 
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cuickly  below  0  dB  of  S/N  ratio  and  eventually  performs 
similarly  to  the  conventional  linear  prediction  analysis. 
Therefore,  -20  dB  seems  to  be  the  lowest  S/N  ratio 
at  which  System  B  shows  some  improvement  over  the  conven¬ 
tional  linear  prediction  analyis.  Fourth,  the  performance 
of  System  B  is  essentially  the  same  for  both  types  of 
excitations,  which  is  consistent  with  our  observations 
in  Chapter  VII. 

Figure  3.3  is  essentially  the  same  as  Figure  8.1 
with  the  difference  in  that  LMSE  is  plotted  to  determine 
the  performance  of  System  C.  The  three  systems  plotted 
are  C-l,  C-2  and  C-3  listed  in  Table  8.1.  From  Figure 
8.3,  the  following  points  are  noted.  First,  System  C 
is  capable  of  performing  better  than  the  conventional 
linear  prediction  analysis  for  a  wide  range  of  S/N  ratios. 
Second , System  C  with  k=2  shows  a  better  performance  than 
with  k=l  or  3  at  the  S/N  ratios  above  -10  c3 .  This 
result  is  consistent  with  our  observations  in  Chapter  VII. 
Since  k  is  a  real  number,  there  may  be  a  more  optimum  k 
which  is  not  an  integer.  To  understand  how  much  more 
improvement  can  be  made  by  a  different  choice  of  k, LMSE 
( S_^ ,  E^ ,  S/N=10  dB)  was  computed  for  k  between  1.0  and  3.0 
sampled  at  twenty  equally  spaced  points  (i.e.,  k=l.l,  1.2, 
.  .  .  .  ,  2 . 8 , 2 . 9 )  .  It  was  found  that  k=2.G  is  the  optimum 
among  the  20  different  values  of  k.  Even  though  k  has 
not  been  varied  for  all  its  possible  values  at  the  S/N 
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ratios  considered,  it  appears  that  other  choices  of 
k  do  not  significantly  improve  the  performance  of  System 
C.  Third,  System  C  degrades  quickly  below  0  dB  of  S/N 
ratio  and  eventually  performs  worse  than  the  conventional 
linear  prediction  analysis.  Therefore,  -10  dB  seems  to 
be  the  lowest  S/N  ratio  at  which  System  C  with  k=2 
shows  some  improvement  over  the  conventional  linear  predic¬ 
tion  analysis.  Fourth,  the  performance  of  System  C  is 
essentially  the  same  for  both  types  of  excitation  which 
are  consistent  with  our  observations  in  Chapter  VII. 

In  Figure  8.4  are  shown  the  results  of  cases  A-2, 

B-10  and  C-2  which  seem  to  be  approximately  the  best  that 
can  be  achieved  by  the  three  systems.  The  case  of  C-0 
is  also  shown  to  facilitate  the  comparison  with  the 
conventional  linear  prediction  analysis.  Figure  8.5  is 
equivalent  to  Figure  8,4  except  that  Normalized  LMSE 
is  plotted  instead  of  LMSE .  From  Figures  8.4  and  8.5, 
the  following  points  are  noted.  First,  below  S/N  ratio 
of  -20  dB,  none  of  the  three  systems  performs  better 
than  the  conventional  linear  prediction  analysis.  Between 
-20  and  -10  dB  of  S/N  ratio,  System  B  after  ten  iterations 
performs  best.  Approximately  from  -10  dB  to  20  dB  of 
S/N  ratio,  System  A  after  two  iterations  shows  the  best 
performance.  Between  20  to  40  dB  of  S/N  ratio,  System 
C  with  k=2  performs  best.  However,  the  improvement  of 
System  C  over  System  A  or  System  B  is  not  large.  Above  the 


LMS 


(b)  pulse  train  excitation 


B- 10 
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i-10  and  C-2  based 
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S/N  ratio  of  40  dB,  all  the  three  systems  essentially 
reach  the  same  performance.  This  result  indicates  that 
no  one  system  performs  best  at  all  S/N  ratios.  Since 
the  intelligibility  of  speech  changes  between  essentially 
zero  to  near  perfect  in  the  range  of  S/N  ratios  between 
-10  and  20  dB,  System  A  after  two  iterations  would  be 
most  useful  for  various  practical  applications  under 
the  objective  criterion.. 

In  Figure  8.6,  the  dotted  line  shows  the  best  that 
can  be  achieved  by  any  combination  of  the  three  systems 
discussed  in  Figures  8.4  and  8.5.  The  solid  line 
corresponds  to  the  conventional  linear  prediction  analysis. 
Therefore,  the  difference  between  the  solid  line  and  the 
dotted  line  shows  the  improvement  that  can  be  achieved 
by  any  combination  of  the  three  systems  developed  in  Chapter 
VI.  How  this  improvement  translates  to  the  improvement 
in  the  listener's  subjective  domain  is  the  topic  of  the 
next  section. 

VIII. 3  Subjective  Evaluation:  Potential  Bandwidth 
Compression  Systems 

In  this  section,  we  discuss  the  performance  of  the 
three  systems  as  potential  bandwidth  compression  systems 
of  noisy  speech.  When  the  speech  model  parameters  are 
properly  coded  they  would  correspond  to  true  bandwidth 
compression  systems.  In  Section  VIII. 3.1,  the  test  sentences 


that  have  been  used  in  all  the  subjective  tests  are  listed. 

In  Section  VIII. 3. 2,  the  speech  synthesis  system  used  in  syn¬ 
thesizing  speech  based  on  the  estimated  all  pole  coefficients 
by  various  potential  bandwidth  compression  systems  is  dis¬ 
cussed.  In  Section  VIII. 3. 3,  various  systems  are  compared 
with  each  other  and  based  on  a  very  informal  listening,  the 
potential  bandwidth  compression  system  that  performs  best  is 
determined.  In  Section  VIII. 3. 4,  the  system  chosen  in  Sec¬ 
tion  VIII. 3. 3  is  compared  with  the  conventional  linear  pre¬ 
diction  analysis  by  fifteen  listeners  and  the  results  obtained 
are  discussed. 

VIII. 3.1  Test  Sentences 

In  all  the  subjective  comparisons  discussed  in  this 
chapter,  the  following  five  English  sentences  are  used: 
sentence  1:  They  took  the  cross  town  bus. 
sentence  2:  That  shirt  seems  much  too  long, 
sentence  3:  He  has  the  bluest  eyes, 
sentence  4:  The  ball  dropped  from  his  hands, 
sentence  5:  Line  up  at  the  screen  door. 

Sentences  1,  3,  and  5  are  spoken  by  adult  male  speakers  and 
sentences  2  and  4  are  spoken  by  adult  female  speakers. 

VIII. 3. 2  Speech  Analysis/Synthesis  System 
In  the  analysis  of  speech,  the  all  pole  coefficients  are 
estimated  by  various  different  systems.  The  gain  factor  g  is 


estimated  by  an  energy  consideration  such  that  the  synthe¬ 
sized  speech  has  the  energy  that  is  approximately  equal  to 
2  2 

l  y  (n)-£  E[d  (n) ] .  In  the  case  of  the  conventional  linear 
n  n 

prediction  analysis,  the  gain  g  is  obtained  such  that  the 

synthesized  speech  has  the  energy  that  is  approximately  equal 
2 

to  l  y  (n) .  The  source  information  consists  of  the  voicing/ 
n 

unvoicing  decision  and  the  pitch  period  in  the  case  of  voicing. 
The  source  information  is  obtained  from  the  noise-free 
speech  and  the  same  source  information  is  used  in  all  cases. 

In  the  analysis,  the  number  of  all  pole  coefficients 
p  is  assumed  to  be  10,  the  analysis  window  used  is  a  rectan¬ 
gular  window  of  256  points  long  and  after  each  analysis,  the 
window  is  moved  by  128  points  and  therefore  the  current  ana¬ 
lysis  window  overlaps  with  the  previous  analysis  window  by 
128  data  points.  Other  choices  of  windows  such  as  Hamming 
window  were  also  considered.  The  subjective  improvements 
by  other  choices  of  windows  were  minor  in  all  cases. 

In  the  speech  synthesis,  the  system  in  Figure  3.1  is 
used  to  generate  speech. 

VIII. 3. 3  Preliminary  Comparison 

The  synthesized  speech  at  three  S/N  ratios  (i.e.  20  dB, 

10  dB,  0  dB)  by  various  systems  listed  in  Table  8.1  has  been 
compared  informally  with  each  other  by  a  few  listeners  and  the 
following  subjective  judgements  were  made. 
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VIII. 3. 3.1  Comparison  of  Systems  A-l,  A-2  and  A-3 
As  the  number  of  iterations  of  System  A  increases  from 
one  iteration  to  three  iterations,  it  has  been  observed  that 
speech  sounds  clearer  and  noise  is  reduced  more.  However, 
some  "musical  tone"  like  background  noise  becomes  more 
apparent  and  intenser  as  the  number  of  iterations  increases. 

It  appears  that  such  speech  degradation  is  primarily  due  to 
the  possible  incorrect  estimation  of  the  formant  frequencies, 
particularly  at  higher  formants.  As  a  reasonable  compromise. 
System  A-2  appears  to  be  better  than  either  System  A-l  or 
System  A-3. 

VIII. 3. 3. 2  Comparison  of  Systems  B-2,  B-5  and  B-10 
As  the  number  of  iterations  of  System  B  increases  from 
one  iteration  to  ten  iterations,  it  has  been  observed  that 
speech  appears  clearer  and  noise  seems  to  be  reduced  more. 

For  System  B,  it  appears  that  the  performance  of  System  B-10 
is  better  than  System  B-2  or  System  B-5. 

VIII. 3. 3. 3  Comparison  of  Systems  C-l,  C-2  and  C-3 
For  the  S/N  ratios  of  10  dB  and  20  dB,  it  appears  that 
the  performance  of  System  C-2  is  better  than  System  C-l.  At 
the  S/N  ratio  of  0  dB,  System  C-2  appears  to  generate  clearer 
voiced  sounds.  However,  many  segments  of  unvoiced  sounds 
and  the  higher  formants  of  voiced  sounds  essentially  disappear 
due  to  the  subtraction  of  twice  as  much  average  short  time 
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energy  spectrum  of  noise  from  the  short  time  energy  spectrum 
of  noisy  speech  when  the  noise  level  is  high  relative  to 
the  signal  level. 

A  comparison  between  System  C-2  and  C-3  indicates  that 
the  performance  of  System  C-2  is  better  than  System  C-3 
at  all  three  S/N  ratios  considered. 

VIII. 3. 3. 4  Comparison  of  Systems  A-2,  B-10  and  C-2 

At  the  S/N  ratios  of  10  and  20  dB,  System  A-2  appears 
to  generate  more  intelligible  and  higher  quality  speech 
than  System  B-10  or  C-2.  At  the  S/N  ratio  of  0  dB,  System 
A-2  and  B-10  perform  better  than  System  C-2.  However,  the 
choice  between  System  A-2  and  System  B-10  is  difficult, 
since  System  a-2  appears  to  have  removed  more  random  back¬ 
ground  noise  but  generated  more  "musical  tone"  like  distor¬ 
tion  which  is  quite  pronounced  at  this  S/N  ratio.  Despite 
this  difficulty,  we  have  chosen  System  A-2  to  be  compared 
to  the  conventional  linear  prediction  analysis  for  a  speech 
preference  test  discussed  in  the  next  section. 

VIII. 3. 4  Evaluation  of  System  A-2  Relative  to 
Conventional  LPC  Method 

In  general,  a  fair  evaluation  of  either  a  bandwidth 
compression  system  or  speech  enhancement  system  should  be 
based  on  many  factors  such  as  intelligibility,  speech 
quality,  listener  fatigue,  etc.  The  main  purpose  of  the 
subjective  tests  in  this  dissertation  is  a  preliminary 
examination  to  determine  whether  or  not  the  class 


of  systems  developed  in  this  thesis  deserve  further 
research  efforts  in  terms  of  improving  and  evaluating 
them.  With  such  a  purpose  in  mind,  we  have  taken  a  very 
limited  point  of  view  and  performed  a  speech  preference 
test  with  a  small  amount  of  test  material.  The  test 
procedures  and  results  are  discussed  in  this  section. 

VIII. 3. 4.1  Test  Material  and  Procedures 

The  test  material  consists  of  the  five  English 
sentences  described  in  Section  VIII. 3.1.  The  S/N  ratios 
considered  in  the  test  are  0  dB,  5  dB,  10  dB,  15  dB, 
and  20  dB . 

Two  sentences  were  constructed  for  each  of  the  five 
English  sentences  and  five  S/N  ratios  based  on  the 
analysis/synthesis  system  discussed  in  Section  VIII. 3. 2. 
One  of  the  two  sentences  corresponded  to  System  A-2 
and  the  other  sentence  corresponded  to  the  conventional 
linear  prediction  analysis.  Therefore,  a  total  of 
fifty  sentences  were  constructed. 

The  test  consisted  of  three  sessions:  one  practice 
session  and  two  main  sessions.  Session  I  and  Session  II. 
The  practice  session  was  intended  primarily  to  acquaint 
the  listerners  with  the  test  procedures.  Session  I 
was  devoted  to  evaluating  System  A-2  as  a  potential 
bandwidth  compression  system  and  Session  II  was  devoted 
to  evaluating  System  A-2  as  a  speech  enhancement  system. 
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The  test  materials,  procedures  and  results  of  Session 
II  will  be  presented  in  Section  VIII. 4  where  we  discuss 
System  A-2  as  a  speech  enhancement  system. 

Session  I  consisted  of  five  parts,  each  part 
corresponding  to  one  of  the  five  S/N  ratios.  Each  part 
consisted  of  five  trials.  Each  of  the  five  trials 
corresponded  to  one  of  the  five  English  sentences.  In 
each  trial,  two  sentences  were  presented,  one  of  which 
corresponded  to  System  A-2  and  the  other  corresponded 
to  the  conventional  linear  prediction  analysis.  The 
order  of  the  presentation  of  the  two  sentences  was 
randomized  in  each  trial. 

The  listeners  were  asked  to  judge  in  each  trial 
which  of  the  two  sentences  was  more  preferable.  It 
was  explained  to  the  listeners  that  "more  preferable" 
could  mean  "more  intelligible",  "of  higher  quality", 

"less  noisy",  any  combination  of  them,  etc.  and  it 
was  left  entirely  up  to  each  individual  listener  to  use 
his  own  interpretation  of  "more  preferable".  In  each 
trial,  the  listeners  were  able  to  answer  in  five  differ¬ 
ent  discrete  categories:  the  first  sentence  is  definitely 
more  preferable,  the  first  sentence  appears  to  be  more 
preferable,  no  preference  between  the  two  sentences,  the 
second  sentence  appears  to  be  more  preferable,  and  the 
second  sentence  is  definitely  more  preferable.  It  was 
emphasized  in  the  test  that  the  judgement  in  each  trial 


should  be  made  as  independently  as  possible  of  all  the 
previous  trials. 


VIII. 3. 4. 2  Data  Analysis  and  Results 
Each  response  of  a  listener  was  converted  to  a 
numerical  value  in  the  following  manner: 

2:  System  A-2  is  definitely  more  preferable 
1:  System  A-2  appears  to  be  more  preferable 
0:  no  preference 

-1:  The  conventional  LPC  analysis  appears  to  be 

more  preferable 

-2:  The  conventional  LPC  analysis  is  definitely 

more  preferable 

The  numerical  value  assigned  to  each  response  was  considered 

to  represent  the  preference  index  of  System  A-2  P  ( S^ ,  ,  R,^) 

where  S^  represents  the  ith  English  sentence  and  thus 

l^i^_5 ,  Lj  represents  the  jth  listener  and  thus  l<_j<15 

since  fifteen  listeners  participated  in  the  test,  and  R^ 

represents  the  3c th  S/N  ratio  considered  and  thus  l<k<3 

(k=l  corresponding  to  S/N=0  dB ,  k=2  corresponding  to 

S/N=5  dB,  etc.).  From  P(S.,L.,R,  ),  P(L.,R.)  was  obtained 

i  3  k  j  K 

by 


p<vv  =  ?  ih  p(si 'LyV 


(8-7) 


From  P  ( L  .  ,  R,  )  in  equation  (8-7),  p  (R,  )  and  PC_(R.  )  were 

J  &  si  K  &D  K 


obtained  by 
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Therefore,  a  positive  PM(R},)  represents  the  preference 
of  System  A-2  over  the  conventional  linear  prediction 
analysis  averaged  over  the  five  sentences  used  as  test 
material  and  fifteen  listeners.  The  highest  number 

(R^)  is  the  standard 
deviation  of  P(L_.,R^)  and  represents  the  variability 
among  the  listeners  in  their  responses. 

PM(Rk)  and  ars  tabluated  in  Table  8.3  and 

plotted  in  Figure  8.7.  The  solid  line  in  Figure  8.7 
corresponds  to  and  the  difference  between  the  solid 

line  and  either  the  upper  or  lower  dotted  line  corresponds 
to  psD^Rk^'  Even  though  the  test  was  not  performed  at  the 
S/N  ratio  of  -00  or  +«,  we  can  deduce  the  results  from 
the  theoretical  considerations.  At  the  S/N  ratio  of 
System  A-2  is  equivalent  to  the  conventional  linear 
prediction  analysis  and  hence  we  would  expect  that  P^(S/N 
ratio  =  °°)=0.  At  the  S/N  ratio  of  -<*>,  the  preference  if 


possible  for  PM(R}J  is  2.  PSD 


any  does  not  mean  much. 
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Table  8.3 

Results  of  the  Speech  Preference  Test  in  which  System  A-2 
is  Used  as  a  Potential  Bandwidth  Compression  System 


S/N  Ratio 

W 

PSD(V 

0  dB 

1.413 

0.529 

5  dB 

1.387 

0.481 

10  dB 

1.040 

0.662 

15  dB 

1.600 

0.343 

20  dB 


1.293 


0.473 
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Pm(Rk) 


20  dB 


Figure  8.7  (a)  Results  of  the  speech  preference  test  in 
System  A-2  is  used  as  a  potential  bandwidth  compression 
system.  The  solid  line  represents  ,  and  the  dista 
between  the  solid  line  and  the  dotted  line  represents  P 


.iLJUWUMH 
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VIII. 3. 4. 3  Discussions 

From  the  results  in  Figure  8.7  it  is  clear  that 
System  A-2  is  preferred  over  the  conventional  linear 
prediction  analysis  at  all  the  five  S/N  ratios  that 
have  been  considered.  We  conclude  that  these  results 
are  sufficiently  encouraging  to  devote  further  research 
efforts  in  improving  and  evaluating  a  class  of  systems 
developed  in  this  dissertation. 

VIII. 4  Subjective  Evaluation:  Speech  Enhancement  Systems 

As  was  discussed  before,  the  systems  that  we  developed 
in  Chapter  V  and  Chapter  VI  can  be  used  not  only  as 
bandwidth  compression  systems  but  also  as  speech  enhance¬ 
ment  systems.  There  are  two  ways  that  the  systems 
developed  in  this  thesis  can  be  used  for  speech  enhance- 

A 

ment.  One  of  them  is  to  use  the  estimated  speech  sw(n) 
as  enhanced  speech.  An  alternative  way  is  to  use  the 
analysis/synthesis  system  as  a  speech  enhancement  system. 
Since  a  complete  analysis/synthesis  system  requires 
the  estimation  of  source  information,  the  evaluation  of 
the  systems  as  speech  enhancement  systems  in  this  section 
are  restricted  to  the  case  in  which  the  estimated  speech 

A 

sw(n)  is  used  as  enhanced  speech.  Some  discussions  on 
using  a  complete  analysis/synthesis  system  for  speech 
enhancement  are  given  in  Section  VIII. 5. 


In  Section  VI I I. 4.1,  the  speech  enhancement  systems 


that  have  been  used  for  evaluation  are  specified.  In 
Section  VIII. 4. 2,  we  remark  briefly  on  the  relative 
performance  of  various  systems  listed  in  Table  8.1  as 
speech  enhancement  systems.  In  Section  VIII. 4. 3,  the 
performance  of  System  A-2  is  evaluated  by  a  speech 
preference  test. 

VIII. 4.1  Speech  Enhancement  Systems 

The  speech  enhancement  systems  are  based  on  the 

/S  /V 

estimated  s  (n) .  In  System  A,  s  (n)  is  obtained  in 
w  ■“  w 

A 

Step  2.  In  System  B,  sw(n)  is  obtained  in  Step  2B.  In 
System  C,  s^(n)  is  obtained  in  Step  2.  The  analysis  is 
again  based  on  a  tenth  order  all  pole  system  with  a  10  kHz 
sampling  rate.  In  the  analysis,  a  triangular  window  of 
400  points  was  used  with  a  frame  rate  of  200  points  per 

/A 

frame.  The  estimated  sw(n)  is  added  back  together  in 
the  same  way  it  has  been  analyzed  as  is  shown  in  Figure 
8.8. 

VIII. 4. 2  Preliminary  Comparison 

The  differences  in  performance  among  various  speech 
enhancement  systems  are  very  similar  to  the  differences 
in  performance  among  various  potential  bandwidth 
compression  systems  discussed  in  Section  VIII. 3. 3.  There¬ 
fore,  the  discussions  in  Section  VIII. 3. 3  also  apply  to 
the  three  systems  as  speech  enhancement  systems. 


Output  for  Speech  Construction 


Figure  8.3  Data  segmentation  for  the  analysis  and  construction 
of  speech  in  a  speech  enhancement  system  based  on  System  A-2 


VIII. 4. 3  Evaluation  of  System  A-2  as  a  Speech 


Enhancement  System 

All  aspects  of  the  evaluation  of  System  A-2  as  a 
speech  enhancement  system  are  identical  to  its  evaluation 
as  a  potential  bandwidth  compression  system  discussed 
in  Section  VIII. 3  with  the  following  two  differences.  On 
difference  is  that  the  comparison  was  made  between  noisy 
speech  and  speech  enhanced  by  System  A-2  rather  than 
between  synthesized  speech  by  the  conventional  LPC 
method  and  System  A-2.  Another  difference  is  that  System 
A-2  as  a  bandwidth  compression  system  was  evaluated  in 
Session  I  as  was  discussed  in  Section  VIII. 3,  while 
System  A-2  as  a  speech  enhancement  system  was  evaluated 
in  Session  II  of  the  speech  preference  test.  The 
responses  obtained  in  Session  II  of  the  speech  preference 
test  were  analyzed  in  the  same  manner  as  those  obtained 
in  Session  I.  To  differentiate  the  results  of  Session  II 


from  Session  I,  we  use  Q(S.,L.,R  ),  Q(L.,R.),  QwdO, 

13k  3  K  m  k 

QSD(Rk}  in  Place  of  P(Si,Lj,R]c),  P^Uy, 

Pgo^^k^  to  denote  the  preference  index  obtained  from  the 
responses  in  Session  II.  Therefore,  Q(S.,L.,R.)  denotes 

l  3  K 

the  preference  index  as  a  function  of  the  ith  English 
sentence,  jth  listener  and  kth  S/N  ratio.  The  equations 
parallel  to  equations  (8-7)  and  (8-8)  are 


0(Lj'Rk>  -  5  X  Q<VVV 


(8-9a 
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V-V  *  t?  £  °«yV 
°sd(V  * 


(8-9b) 


( 8-9c) 


Like  pM(Rk)  /  a  positive  Q^R^)  represents  the  preference 
of  enhanced  speech  by  System  A-2  over  the  noisy  speech 
averaged  over  the  five  sentences  used  as  test  material 
and  fifteen  listeners.  The  highest  value  possible  for 
QM(Rk^  is  2*  ^SD^^k^  is  t*ie  stan^ar^  deviation  of 

Q(LjfR^)  and  represents  the  variability  among  the  listen¬ 
ers  in  their  responses. 

and  ^SD^Rk^  are  tai3Ulated  in  Table  8.4  and 
plotted  in  Figure  8.9.  The  solid  line  in  Figure  8.9 
corresponds  to  Q^R^)  and  the  difference  between  the  solid 
line  and  either  the  upper  or  lower  dotted  line  corresponds 
to  For  the  same  reasons  discussed  in  Section 

VIII.  3,  Qm(S/N=°°)  would  be  zero  and  QM(S/N=-°°)  does  not 
mean  much. 

Unlike  the  results  of  System  A-2  as  a  potential 
bandwidth  compression  system,  enhanced  speech  processed 
by  System  A-2  is  preferred  only  at  relatively  high  S/N 
ratios.  At  lower  S/N  ratios,  the  "musical  tone"  like 
background  noise  which  arises  primarily  from  the  discon¬ 
tinuities  of  the  upper  formant  frequencies  in  a  frame  by 
frame  analysis  scheme  is  sufficiently  noticeable  that  the 
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Table  8.4 

Results  of  the  Speech  Preference  Test  in  Which 
System  A-2  is  Used  as  a  Speech  Enhancement  System 


S/N  Ratio 

w 

QSD(Rk) 

0  dB 

-0.240 

1.079 

5  dB 

0.240 

1.023 

10  dB 

0.293 

0.867 

15  dB 

0.467 

1.042 

20  dB 

0.747 

0.728 

0 


5 


10 


IS 


20  dB 


S/N 


Figure  8.9  Results  of  the  speech  preference  test  in  which 
System  A-2  is  used  as  a  speech  enhancement  system.  The  solid 
line  represents  Q  (R^) ,  and  the  distance  between  the  solid 
line  and  the  dotted  line  represents  Q  (R  ) . 


*  ■+ 


noise  reduction  by  System  A-2  does  not  sufficiently 
offset  the  speech  degradation  for  some  listeners.  The 
responses  of  the  listeners  also  indicate  that  some 
listeners  have  strong  preference  for  processed  speech 
while  some  other  listeners  have  strong  preference  for 
unprocessed  noisy  speech.  This  is  reflected  by  the 
large  standard  deviation  shown  in  Figure  8.9. 

In  the  context  of  this  thesis,  there  are  several 
methods  that  may  be  used  to  eliminate  or  mask  the 
"musical  tone"  like  background  noise  and  they  will  be 
discussed  in  Chapter  IX  where  various  improvements  are 
suggested  for  the  class  of  systems  developed  in  this  thesis. 

VIII. 5  Additional  Studies 

VIII. 5.1  Speech  Enhancement  by  a  Complete  Analysis/ 
Synthesis  System 

In  the  context  of  the  work  in  this  thesis,  speech 
enhancement  may  be  achieved  by  a  complete  analysis/synthe¬ 
sis  system.  To  consider  the  feasibility  of  such  a  scheme, 
the  speech  material  synthesized  based  on  System  A-2  in 
Section  VIII. 3  were  compared  with  the  enhanced  speech 
obtained  in  Section  VIII. 4.  Above  the  S/N  ratio  of  about 
10  dB,  the  enhanced  speech  in  Section  VIII. 4  appeared  to 
sound  better,  while  below  the  S/N  ratio  of  about  10  dB 
the  opposite  appeared  to  be  true.  It  is  difficult  to 


interpret  this  result  for  several  reasons.  The  source 
information  used  in  the  synthesis  of  speech  in  Section 
VIII. 3  was  obtained  from  noise-free  speech.  Such  an 
accurate  source  information  is  not  available  in  practice. 

On  the  other  hand,  the  source  model  (random  noise  or  a 
train  of  pulses)  used  is  a  very  simplified  one  and  a 
more  sophisticated  excitation  source  such  as  voice 
excitation  may  improve  the  quality/intelligibility  of 
the  synthesized  speech.  Without  further  study  in  this 
area,  the  informal  listening  results  imply  that  with 
the  simple  source  model  and  System  A-2,  the  approach 

/v 

to  use  the  estimated  sw(n)  as  enhanced  speech  is  better 
than  the  approach  to  use  an  LPC  analysis/synthesis 
scheme  above  the  S/N  ratio  of  10  dB. 

VIII. 5. 2  System  A-2  as  a  Pre-processor  for  Other 
Bandwidth  Compression  Systems 

As  has  been  discussed  in  Chapter  III,  the  fact  that 
Sq  is  estimated  in  addition  to  a  is  important  in  the 
context  of  bandwidth  compression  of  noisy  speech  as 
well  as  speech  enhancement.  This  is  because  if  we 
estimate  only  a,  then  we  are  limited  to  a  class  of  vocoding 
systems  known  as  "LPC"  vocoders. 

As  an  example  of  using  the  class  of  systems  developed 
in  this  dissertation  as  pre-processors  for  other  vocoding 
systems,  enhanced  speech  by  System  A-2  was  processed  by  a 


real  time  channel  vocoder  at  Lincoln  Laboratories  and  was 
compared  to  speech  processed  by  the  same  vocoder  with 
the  unprocessed  noisy  speech  as  input.  Based  on  informal 
listening,  it  appears  that  the  improvement  made  by 
System  A-2  is  comparable  to  the  improvement  discussed 
in  Section  VIII. 3  where  System  A-2  as  a  potential  band¬ 
width  compression  system  was  compared  to  the  conventional 
linear  prediction  analysis. 

VIII. 6  Summary 

In  this  chapter,  the  three  systems  developed  in 
Chapter  VI  have  been  evaluated  under  both  an  objective 
and  subjective  criteria.  Under  the  objective  criterion 
with  the  selection  of  the  test  material  discussed  in 
Section  VIII. 2,  we  conclude  that  all  the  three  systems 
developed  in  Chapter  VI  with  a  proper  choice  of 
the  parameters  perform  better  than  the  conventional  linear 
prediction  analysis  above  -10  dB  of  the  S/N  ratio.  Below 
-20  dB  of  the  S/N  ratio,  none  of  the  three  systems  performs 
any  better  than  the  conventional  linear  prediction 
analysis.  Among  the  class  of  systems  implemented  in 
this  dissertation.  System  A  after  two  iterations  performs 
best  under  the  objective  criterion  at  various  S/N  ratios 
of  practical  interest. 

As  a  preliminary  examination  to  determine  whether  or 
not  the  class  of  systems  developed  in  this  thesis  have 
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potential  to  be  used  as  bandwidth  compression  and  speech 
enhancement  systems  of  noisy  speech,  System  A  has  been 
evaluated  by  a  speech  preference  test.  The  results  of 
the  test  indicate  that  System  A  is  clearly  preferred 
over  the  conventional  linear  prediction  analysis  as 
a  potential  bandwidth  compression  system.  In  the  context 
of  using  System  A  as  a  speech  enhancement  system,  the 
results  are  not  as  positive.  However,  there  are  a  number 
of  improvements  that  can  be  made  as  we  will  discuss  in 
Chapter  IX.  Based  on  the  evaluation  performed  in  this 
chapter,  we  conclude  that  the  results  obtained  are 
sufficiently  encouraging  to  invest  further  research 
efforts  in  improving  and  evaluating  the  class  of  systems 
developed  in  this  dissertation. 


-218- 

CHAPTER  IX  FUTURE  RESEARCH 
IX. 1  Introduction 

In  this  chapter,  we  discuss  a  number  of  areas  for 
future  research  that  are  related  to  this  dissertation. 

The  areas  of  future  research  can  be  broadly  classified 
into  three  different  categories.  The  first  category 
is  improving  the  systems  implemented  in  this  thesis  and 
is  discussed  in  Section  IX. 2.  The  second  category  is 
issues  related  to  adapting  the  systems  to  real  world 
situations  and  is  discussed  in  Section  IX. 3.  The  third 
category  is  the  theoretical  issues  and  systems  for 
theoretical  interest  and  is  discussed  in  Section  IX. 4. 

IX. 2  Improvements 

A  serious  attempt  has  not  been  made  in  this 
dissertation  to  improve  the  performance  of  the  systems 
implemented  in  this  thesis.  A  few  simple  modifications 
may  improve  the  performance  of  the  systems  developed. 

In  this  section,  such  modifications  are  discussed. 

To  indicate  some  potential  areas  in  which  some 
improvement  can  be  made,  three  spectrograms  are  shown 
in  Figures  9.1,  9.2  and  9.3.  Figure  9.1  represents  the 
spectrogram  of  noise-free  speech  that  corresponds  to 
"Line  up  at  the  screen  door".  Figure  9.2  represents 
the  spectrogram  of  synthesized  speech  by  the  conventional 
LPC  method  at  the  S/N  ratio  of  0  dB.  Figure  9.3  represents 


bandwidth  compression  system  at  the  S/N 
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the  spectrogram  of  the  synthesized  speech  by  System  A-2 
as  a  potential  bandwidth  compression  system  at  the  S/N 
ratio  of  0  dB.  Comparing  Figures  9.1  and  9.3,  it  is 
clear  that  there  are  at  least  two  main  problems  that 
cause  speech  degradation  in  the  process  of  reducing  the 
background  noise.  One  of  them  is  the  non-smooth  formant 
transitions.  This  problem  occurs  when  the  formant 
frequencies  of  speech  change  relatively  fast  and  the 
frame  rate  is  low  in  a  frame  by  frame  analysis  environment. 
Such  a  problem  may  cause  some  speech  degradation  and  can 
be  solved  by  a  higher  frame  rate  with  some  optimization 
of  the  analysis  window  length,  window  type,  or  an  inter¬ 
polation  scheme  between  frames  in  the  synthesis.  The 
second  problem  which  is  more  serious  arises  due  to  the 
errors  made  by  System  A  in  estimating  the  formant  frequen¬ 
cies.  Such  errors  cause  discontinuities  in  the  formant 
frequencies,  and  occur  more  often  in  the  higher  formants 
where  the  local  S/N  ratio  is  relatively  low.  Such  formant 
discontinuities  are  probably  the  primary  cause  of  the 
"musical  tone"  like  background  noise  discussed  in  Chapter 
VIII.  In  the  remainder  of  this  section,  several  ways 
that  may  solve  or  reduce  the  effect  of  the  formant 
discontinuity  problem  are  discussed. 

IX. 2.1  Incorporation  of  A  Priori  Information 

In  the  theoretical  results  developed  in  this  disserta- 
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tion,  it  is  possible  to  incorporate  a  priori  information 
of  a.  One  potential  source  from  which  some  a  priori 
information  can  be  obtained  is  from  the  nearby  analysis 
frames.  Since  the  human  vocal  tract  can  not  move 
arbitrarily  fast,  the  results  of  one  analysis  frame  are 
in  some  sense  correlated  with  the  results  of  the 
next  analysis  frame  except  at  rapid  onset  or  change. 

One  way  to  incorporate  the  results  of  the  past  analysis 
frames  in  the  analysis  of  the  current  frame  is  to 
determine  p(a),  the  a  priori  density  of  a,  in  terms  of 
the  results  of  the  previous  analysis  frame. 

Some  very  preliminary  experiment  in  which  p(a)  is 

assumed  to  be  N(a, PQ)  where  a  is  the  estimated  a  in 

2  2 

the  previous  analysis  frame  and  PQ  is  a  *1  for  some  a 
indicates  that  adding  some  a  priori  information  from 
the  previous  analysis  frame  to  the  current  analysis 
frame  can  reduce  the  "musical  tone"  like  background 
noise.  Some  optimization  in  the  choice  of  a  and  PQ  may 
lead  to  some  noticeable  improvement. 

IX. 2. 2  Smoothing  Formant  Frequencies 

One  effect  of  adding  some  a  priori  information  in 
a  manner  discussed  in  Section  IX. 2.1  is  smoothing  the 
estimated  all  pole  coefficients  a  of  the  individual 
analysis  frames.  Even  though  such  a  method  to  some  extent 
leads  to  indirectly  smoothing  the  formant  frequencies  and 
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thus  eliminating  the  formant  discontinuities,  a  more 
direct  way  would  be  to  smooth  the  formant  frequencies 
themselves.  Such  a  direct  procedure  has  the  additional 
advantage  that  the  formant  frequencies  can  be  smoothed 
discriminately .  More  specifically,  in  the  white  back¬ 
ground  noise  environment  the  upper  formant  frequencies 
are  degraded  more  often  than  the  lower  formant  frequencies 
and  therefore  it  may  be  desirable  to  smooth  only  the 
upper  formant  frequencies. 

Such  a  smoothing  procedure  can  eliminate  the  discon¬ 
tinuities  in  the  formant  frequencies  and  thus  may  reduce 
the  "musical  tone"  like  background  noise.  Furthermore, 
when  the  S/N  ratio  is  relatively  high  such  that  the 
errors  in  the  estimation  of  the  formant  frequencies  do 
not  occur  often,  the  smoothed  formant  frequencies  can  in 
fact  correspond  to  the  true  formant  frequencies. 

IX. 2. 3  Masking  with  Random  Noise 

As  we  discussed  in  Section  II. 2. 6,  in  a  recent 
study,  Schwartz, et  al.  [19],  considered  a  system  which 
is  a  modification  of  System  C  discussed  in  this  thesis 
for  speech  enhancement.  In  the  process  of  eliminating 
the  effect  of  the  background  noise.  System  C  creates  some 
artificial  speech  degradation.  Schwartz, et  al.  hypothe¬ 
sized  that  such  a  degradation  arises  due  to  setting  the 

2  .  2 

estimate  of  [S  (w)  j  to  zero  when  |Y  (uj)|  is  less  than 
w  w 
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2 

k*E[|D(w)|  ].  Therefore,  in  their  speech  enhancement 

~  2 

system,  in  place  of  zero  |S  (oi)  |  was  set  to 

8*E[  |D  (to)  | 2]  for  a  very  small  value  of  B  if  |Y  (co)|^ 
w  w 

2 

is  less  than  (k+S) *E  [  |D  (w) |  ] .  When  such  a  modification 
is  made,  Schwartz, et  al.  has  found  that  some  speech 
degradation  due  to  processing  which  is  uncomfortable 
to  listen  to  disappeared. 

One  explanation  that  such  a  thresholding  method  can 
reduce  some  perceptually  undesirable  speech  degradation 
is  that  it  is  a  way  of  masking  the  speech  degradation. 

Based  on  this  explanation,  then,  an  alternative  way  to 
mask  the  speech  degradation  which  is  easier  to  implement 
than  the  threshold  method  is  to  simply  add  some  random 
noise  to  the  processed  speech.  The  concept  of  masking 
the  artifical  speech  degradation  is  not  limited  to  System 
C  but  can  be  applied  to  any  system  which  generates  some 
perceptually  undesirable  speech  degradation.  The  amount 
of  noise  necessary  to  mask  the  speech  degradation  depends 
on  the  level  of  the  speech  degradation  that  is  to  be 
masked.  In  a  very  preliminary  experiment,  the  processed 
speech  by  System  A  has  been  added  with  some  white  random 
noise.  The  reasonable  level  of  noise  added  to  mask 
the  "musical  tone"  like  background  noise  is  about  15  dB 
below  the  original  background  noise  level.  If  the  process¬ 
ing  suggested  in  Sections  IX. 2.1  or  IX. 2. 2  is  carried  out 
successfully  and  thus  reduce  the  level  of  the  "musical  tone 
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like  background  noise,  then  it  is  expected  that  even  a 
lower  level  random  noise  than  15  dB  below  the  original 
noise  level  may  be  able  to  mask  the  perceptually  unpleas¬ 
ant  speech  degradation  due  to  processing  by  System  A-2. 
Further,  if  the  speech  degradation  occurs  primarily  in 
the  higher  frequency  regions  in  which  the  local  S/N 
ratio  is  relatively  low,  then  adding  high  pass  filtered 
noise  may  be  more  desirable.  A  further  study  should  be 
carried  out  in  determining  the  proper  noise  level  and  the 
type  of  noise  necessary  to  mask  the  speech  degradation 
that  occurs  by  processing  noisy  speech  with  the  class  of 
systems  developed  in  this  dissertaion. 

IX. 3  Adaptation  to  Practical  Problems 

There  are  many  issues  which  require  further  study 
in  implementing  the  class  of  systems  developed  in  this 
thesis  in  practical  environments.  In  this  section,  we 
discuss  some  of  these  issues. 


IX. 3.1  Estimation  of  P, (u) 

a 

In  the  systems  discussed  in  this  thesis,  the  power 
spectrum  of  the  backgroun  noise  P^fu)  is  assumed  to  be 
known.  In  practice,  P^Cw)  has  to  be  estimated  from  the 
noisy  speech  y(n).  If  the  silence  intervals  are  to  be 
used  for  the  estimation  of  P^(u),  a  silence  detector  from  the 
noisy  speech  has  to  be  incorporated  in  the  overall  system. 
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A  related  study  to  the  estimation  of  is  to 

determine  the  sensitivity  of  the  performance  of  the  systems 
developed  in  this  thesis  to  a  possible  incorrect  estima¬ 
tion  of  Pd(u) .  A  system  which  performs  well  when  P^tw) 
is  correctly  estimated  may  degrade  quickly  as  the  estimated 
Pd(oi)  differs  from  the  true  P^Cu).  The  sensitivity 
issue  is  an  important  area  to  be  investigated. 

IX. 3. 2  Estimation  of  Source  Information 

To  develop  a  complete  analysis/synthesis  system 
based  on  the  theoretical  results  developed  in  this 
thesis,  it  is  necessary  to  develop  an  algorithm  that 
estimates  the  source  parameters.  In  the  context  of  this 
dissertation,  we  may  simply  apply  existing  pitch  detectors 

A 

[40,41,42]  to  the  estimated  sw(n).  Alternatively,  there 
may  be  a  more  optimum  way  of  obtaining  the  source  infor¬ 
mation  that  accounts  for  the  presence  of  background 
noise.  The  estimation  of  the  source  parameters  from  the 
noisy  speech  is  an  important  area  for  future  research 
in  developing  a  complete  analysis/synthesis  system. 

IX. 3. 3  Evaluation  of  Systems 

After  some  further  study  on  the  system  improvement, 
it  is  important  to  evaluate  the  systems  in  terms  of  their 
performance  in  improving  speech  intelligibility,  quality, 
etc.  The  choice  of  the  system  may  depend  on  the  specific 


background  noise  environment,  cost  of  implementation,  etc. 


IX. 4  Further  Theoretical  Study  and  Related  Work 
IX. 4.1  Implementation  of  Other  Systems 
In  this  dissertation,  we  considered  estimating  a 
by  maximizing  p(ajy^).  Since  maximizing  ptaj^)  is  a  non¬ 
linear  problem,  we  considered  "sub-optimal"  procedures 
in  which  P^/S^ly^)  is  maximized.  An  attempt  to  maximize 
p(a, SqI^q)  led  to  the  LMAR  and  RLMAP  algorithms  which 
require  solving  only  sets  of  linear  equations  in  an 
iterative  manner.  Further  approximations  of  these  algo¬ 
rithms  led  to  System  A  and  System  B  which  were  implemented. 

An  important  area  of  future  research  from  a  theoretical 
point  of  view  is  a  theoretical  understanding  of  the  relations 
and  properties  of  the  MAP,  LMA?  and  RLMAP  algorithms,  and 
their  implementations.  As  we  discussed  in  Section  V.6, 
a  theoretical  study  to  understand  the  relations  and 
properties  of  the  three  algorithms  is  currently  in  progress. 
The  implementation  of  the  MAP  algorithm  is  important  since 
the  results  obtained  by  maximizing  p(a|y^)  are  the  optimum 
that  can  be  achieved  if  we  follow  the  philosophy  that  is 
taken  in  this  research.  The  implementation  of  the  LMAP 
and  RLMAP  algorithms  is  important  since  it  allows  us  to 
understand  the  performance  degradation  due  to  the  approxima¬ 
tions  made  in  developing  System  A  and  System  3  from  the 
LMAP  and  RLMAP  algorithms.  It  also  allows  us  to  understand 
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the  effect  of  changing  the  problem  from  maximizing 

to  P^SqIYq).  a  comparison  of  the  MAP,  LMAP , 
RLMAP  methods.  System  A  and  System  B  in  terms  of  their 
performances  can  be  a  basis  for  determining  the  extent 
of  further  research  efforts  in  developing  a  different 
approximation  method  to  the  true  MAP  estimation  procedure 

IX. 4. 2  Different  Initial  Estimates  of  a 

In  the  LMAP,  RLMAP  algorithms.  System  A  and  System  B 
we  begin  from  some  initial  estimate  of  a.  In  the  systems 
that  were  implemented,  the  initial  estimate  was  obtained 
by  simply  applying  the  correlation  method  of  linear 
prediction  analysis  to  the  noisy  speech.  Since  the  LMAP 
and  RLMAP  algorithms  are  not  guaranteed  to  give  the 
global  maximum  of  p(a|^)  or  P^s^iv^),  other  initial 
estimates  of  a  may  lead  to  different  estimates  of  a. 

Beginning  from  other  initial  estimates  of  a  can  be 
useful  in  at  least  two  different  ways.  First,  they  may 
lead  to  better  estimates  of  a.  Second,  the  primary 
disadvantage  of  System  B  relative  to  System  A  is  its 
slow  convergence  to  a  reasonable  solution.  If  we  begin 
from  some  other  initial  estimates  of  a,  System  B  may 
converge  to  a  solution  more  quickly.  This  is  an  area  for 


further  studv. 
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IX. 4. 3  Incorporation  of  A  Priori  Information 

There  are  many  levels  in  incorporating  a  priori 
information  based  on  the  knowledge  that  the  noisy  signal 
we  deal  with  is  speech  plus  noise.  In  one  extreme,  we 
could  add  some  a  priori  information  in  a  manner  similar 
to  the  discussions  in  Section  IX. 3.  In  the  other  extreme 
we  may  want  to  capitalize  more  fully  on  the  physiological 
constraints  imposed  by  the  human  vocal  mechanism  and 
even  the  linguistic  constraints  imposed  by  the  rules 
of  the  language.  Since  any  accurate  extra  information 
added  in  estimating  the  speech  parameters  can  potentially 
lead  to  a  better  estimate,  such  additional  knowledge  may 
be  important  in  dealing  with  the  noisy  speech.  To 
understand  what  knowledge  of  speech  we  can  capitalize 
on  and  how  such  knowledge  can  be  used  to  estimate  the 
speech  parameters  better  is  an  important  area  for  future 
research  in  many  areas  of  speech  processing. 

IX. 4. 4  Excitation  by  a  Train  of  Pulses 

In  the  theoretical  development  in  this  dissertation, 
various  systems  were  developed  based  on  the  assumption 
that  the  excitation  is  white  Gaussian  noise  and  we  simply 
applied  the  same  systems  to  both  unvoiced  and  voiced 
sounds.  If  we  estimate  the  system  parameters  of  voiced 
speech  based  on  the  assumption  that  the  excitation  is  a 
train  of  pulses,  then  a  better  estimate  of  the  speech 
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parameters  may  perhaps  be  obtained.  Since  a  majority 
of  speech  sounds  are  voiced  and  the  voiced  sounds  are 
very  important  in  the  perception  of  speech,  an  attempt 
to  estimate  the  speech  parameters  of  voiced  sounds  more 
accurately  appears  attractive.  The  notion  to  capitalize 
on  the  periodicity  of  voiced  sounds  is  also  related  to 
the  incorporation  of  more  knowledge  of  speech  in  estimating 
the  speech  parameters. 

IX. 4. 5  Pole-Zero  Modelling 

In  the  theoretical  development  in  this  thesis,  we 
have  assumed  an  all  pole  transfer  function  in  the  underlying 
speech  model.  In  a  stationary  background  noise  environ¬ 
ment,  the  low  energy  speech  segments  such  as  unvoiced 
speech  degrade  more  quickly  due  to  the  relatively  low 
S/N  ratio  and  thus  are  probably  an  important  factor  in 
decreasing  speech  intelligiblity .  Since  unvoiced  speech 
can  be  better  modelled  by  a  pole-zero  than  an  all  pole 
transfer  function,  the  approach  to  use  a  pole-zero  system 
may  lead  to  a  better  performance  and  it  is  an  important 
area  for  future  research. 
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CHAPTER  X  CONCLUSION 

In  this  thesis,  the  problem  of  enhancement  and  band¬ 
width  compression  of  noisy  speech  was  formulated  as  a 
parameter  estimation  problem,  in  which  we  attempted 
to  estimate  the  parameters  of  an  underlying  speech  model 
from  the  noisy  speech  based  on  the  MAP  estimation  proce¬ 
dure.  Such  an  approach  led  to  two  algorithms  which 
require  solving  sets  of  linear  equations  in  an  iterative 
manner.  Some  approximations  of  the  two  algorithms  led 
to  two  systems  which  are  computationally  simpler  than  the 
two  algorithms  by  taking  advantage  of  a  high  speed  FFT 
algorithm. 

As  a  preliminary  investigation  into  the  performance 
of  the  two  systems  developed  in  this  thesis,  the  two 
systems  were  implemented  and  applied  to  both  real  and 
synthetic  speech  data.  An  objective  and  informal  subjective 
evaluation  indicate  that  the  systems  implemented  perform 
well  as  enhancement  and  potential  bandwidth  compression 
systems  of  noisy  speech. 

A  number  of  studies  were  suggested  for  future  research 
in  this  thesis.  They  include  various  improvements  and 
further  evaluation  of  the  systems  implemented  in  this  thesis, 
implementation  and  evaluation  of  other  systems  developed 
but  have  not  been  implemented  in  this  thesis  and  develop¬ 
ment  of  new  Systems  by  incorporating  more  knowledge  of 
speech . 
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APPENDIX 

In  Appendix  I,  we  summarize  briefly  the  notations 
that  have  been  used  in  the  thesis.  In  Appendix  II, 
a  table  of  LCSE  and  Normalized  LCSE  which  were  discussed 


in  Section  VIII. 2  is  shown 


|0»'>  |{U| 
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APPENDIX  I  SUMMARY  OF  NOTATIONS 
T 

a:  (a1/a2, . . . . ,a  )  ,  an  all  pole  coefficient  vector, 

T  represents  transpose  of  a  matrix 
a  priori  mean  of  a 
ith  estimate  cf  a 
A(oj)  :  F[ot(n)],  discrete  time  Fourier  transform  of  a(n) 


B ( oj )  :  F[8(n)],  discrete  time  Fourier  Transform  of  8  ( n ) 

d(n):  disturbance  or  background  noise;  assumed  to  be 
generated  by  a  Gaussian  random  process 
d  (n) :  d(n)»wg(n),  windowed  background  noise 


-239- 


a (n1 , n2 ) : 
d^:  d  (N-l 

DFT [x (n) ]  : 


(d(n^)  , .  . . .  , d. (n2 )  ) 

0) ,  a  vector  of  background  noise 

M-l  -jil  k-n 

X(k)  =  £  x(n)-e  M  /  M  point  Discrete 

n=0 


Fourier  Transform  of  x(n) 

E[x]:  expected.  value  of  x 

e  :  error  function  to  be  minimized 
P 


F [x (n)  ]  :  X(oj)  *  l  x (n) -e--^11,  discrete  time  Fourier 

n=-°° 

Transform  of  x(n) 


«  «  [I  • 

F~L  CX(cu)  ]  :  x(n)  ®  /  X(oj)  -  e un  •  du ,  the  inverse  discrete 

-TT 

time  Fourier  Transform  of  X(to) 
g:  gain  factor 

H(z):  z  transform  of  the  transfer  function  in  the  under¬ 

lying  speech  model 

,  M-l  k-n 

IDFT [X (k) ] :  x(n)  s  -  I  X(k)-e  ,  the  Inverse  Discrete 

M  k=0 

Fourier  Transform  of  X(k) 
k{n) :  Kalman  filter  gain 

mg:  ( I-A)  1*AI*sI 

m:  mean  of  s^  conditioned  on  a  and  v^ 

N(A,B):  Gaussian  with  mean  of  A  and  covariance  of  B 
P^:  a  priori  covariance  of  a 


-  / 


-240- 


P  (w) :  l  R  (n) • e^un,  the  power  spectral  density  of 

A  X 

n=-» 

x  (n) 

p (Aq) s  probability  density  function  of  A,  or  probability 
density  function  evaluated  at  A=A see  footnote  2 
p(A0|BQ):  analogous  to  p(AQ)  with  the  conditional  density 
function 

R^Cn):  E [x (k) • x (k-n) ]  for  a  stationary  signal  x(n),  or 
correlation  of  x(n) 

Rs:  g2- (I-A)'1- < (I-A)'1)1 

V 

s(n):  signal  or  speech 

s  (n) :  s(n)*w  (n)  ,  windowed  speech 
w  s 

T 

s(n1,n2):  (s  (n-J  s  (n2)  ) 

a Q:  S(N-1,0) 

s^:  the  ith  estimate  of  s^ 

Sj.:  s (-1, -p) 

00 

S(oi):  l  s(n)*e  -3con,  discrete  time  Fourier  Transform  of 

n=-«° 

s  (n) 

jS(oj)|:  magnitude  of  S(w) 

$S(oj):  phase  of  S(ui),  also  denoted  as  <S(w) 

u(n):  a  pulse  train  or  random  noise  excitation 

u(n) :  an  excitation  vector,  typically  zero  mean  white 

Gaussian  noise 


Var[xj:  variance  of  x 
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v(n)  :  an  observation  vector,  typically  zero  mean  white 
Gaussian  noise 

V:  covariance  of  Sq  conditioned  on  a  and  y^ 

w(n):  zero  mean  white  Gaussian  noise  with  unit  variance 

w  (n) :  a  smooth  window  function 
s 

x(n) :  a  state  vector 

x(-l) :  the  initial  state  vector 

A 

x^.:  Maximum  Likelihood  estimate  of  x 

A 

XMAP:  Maximum  A  posteriori  estimate  of  x 

A 

Xj^gj,:  Minimum  Mean  Square  Error  estimate  of  x 
y(n):  s(n)+d(n),  noisy  signal  or  noisy  speech 

y  (n) s  y(n)*w  (n) ,  windowed  noisy  speech 
^(n1,n2)  :  (yCn^  ,••••» y (n2 )) T 
Yq'-  £(N-1,0) 


Y(w):  l  y(n)*e^n,  the  discrete  time  Fourier  Transform 

n=-o° 

of  y (n) 

|Y(w)|:  magnitude  of  Y(u) 

^Y(w):  phase  of  Y(w),  also  denoted  as  <YU) 

£(n) :  an  observation  vector 

00 

4>  (n):  7  x  ( k )  •  x  (k-n)  ,  the  short  time  correlation  of 

x  k=-»w  w 

x  (n) 

*XU>:  F^x(n)] 
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